DOCUMENT  RESUME 


ED  249  269 


TM  840  622 


AUTHOR 
TIULE 


INSTITUTION 

SPONS  AGENCY 

PUB  DATE 
NOTE 

PUB  TYPE 


Hambleton,  Ronald  K.;  Eignor,  Daniel  R. 
A  Practitioner's  Guide  to  Criterion-Referenced  Test 
Development,  Validation,  and  Test  Score  Usage  (Second 
Edition).  Laboratory  of  Psychometric  and  Evaluation 
Research  Report  No.  70. 

Massachusetts  Univ.,  Amherst.  Laboratory  of 
Psychometric  and  Evaluative  Research. 
National  last,  of  Education  (DHEW),  Washington, 
D.C. 

10  Mar  79 
550p. 

Guides  -  Classroom  Use  -  Guides  (For  Teachers)  (052) 
—  Tests/Evaluation  Instruments  (160) 


EDRS  PRICE 
DESCRIPTORS 


IDENTIFIERS 


MF02/PC22  Plus  Postage.  - 

♦Criterion  Referenced  Tests;  *Cutting  Scores; 
♦Evaluation  Methods;  Mastery  Tests;  Models;  Program 
Design;  Research  and  Development;  Scoring;  *Test 
Construction;  *Testing;  Test  Items;  Test  Norms;  Test 
Reliability;  *Test  Results;  Test  Use;  Test 
Validity  * 
♦Standard  Setting 


ABSTRACT 

This  instructional  training  package  introduces 
practitioners  to  methods  for  developing,  validating,  using,#  and 
reporting  criterion-referenced  tests.   It  provides  a  comprehensive 
presentation  of  criterion-referenced  testing  technology.  The  package 
emphasizes  the  most  recent  substantive  and  technological  advances  in 
the  field  that  are  both  important  and  relatively  easy  to  use.  The  10 
units  of  instruction  are:   (1)  "Introduction  to  Criterion-Referenced 
Testing";   (2)  "Preparation  of  Objectives  and  Test' Items";  (3) 
"Assessment  of  Content  Validity"?   (4)  "Test  Assembly  and 
Administration";   (5)  "Reliability,  Validity  and  Norms" ;   (6)  "Issues 
and  Methods  for  Standard-Setting";   (7)   "Criterion-Referenced  Test  and 
Test  Manual  Evaluations";   (8)  "Use  and  Reporting  of  Test  Score 
Information";   (9)  "Design  of  Criterion-Referenced  Testing 
Programs — Two  Examples";  and  (10)  "New  Developments  and  Areas  for 
Further  Research."  Each  unit  is  'divided  into  sections:  a  unit 
overview;  an  introduction  to  covered  topics;   relevant  technical 
materials  and  examples;  occasional  optional  materials;  and  cited 
references.  Some  units  have  additional  references  for  further  study. 
Fiotf-charts ,   figures,  and  tables  are  included  whenever  possible. 
(Author/BS)  * 


*****************************************  ******  ************************ 

*  Reproductions  supplied  by  EDRS  are  the  best  that. -can  be  made  * 

*  from  the  original  document.  * 
*******„.*********  * ******************************************  *********** 


ERIC 


U.S.  DEPARTMENT  OF  EDUCATION 

NATIONAL  INSTITUTE  OF  EDUCATION 

EDUCA  NONA!  HfSOURCfS  INFORMATION 

CONFER  jf-RlC) 
^  This  ilui  uitiiMir  hiis  twvti  reproduced  js 
n»i  i*iVi"I  Mum  ihi*  ptirsun  «i  ofijdnifution 

i>l«JIM.tIifi<|  <! 

Mitiui  i  »»«friqi»s  hitvt*  ht.'f*n  made  to  impr«v« 
rrfiiittjif  ln»«i  .ju.iutv 

•   F'n.tilSnf  nt'Am  upiimmSMtltf"!  if  I  thib  cl(K  » 
IMflll  .J..  Ii.it  it,-.  ,  SS.lf.ly  tfpfHSt'Ht  tiff  if  id'  Ntfc 

position  .it  luiiii  t 


A  Practitioner's  Guide  to  Criterion-Referenced  Test 
Development,  Validation,  and  Test  Score  Usage  » 

(Second  Edition) 


Prepared  By 


Ronald  K.  Eambleton 
University  of  Massachusetts,  Amherst 

and 

.  Daniel  R.  Eignor 
Educational  Testing  Service 


"PERMISSION  TO  REPRODUCE  THIS 
MATERIAL  HAS  BEEN  GRANTED  BY 


TO  THE  EDUCATIONAL  RESOURCES 
INFORMATION  CENTER  (ERIC)." 


iThe  project  reported  herein  was  supported,  in  part,  by  a  grant  from 
the  National  Institute  of  Education,  Department  of  Health,  Education,  and 
Welfare.     However,  the  opinions  expressed  herein  do  not  necessarily  reflect 
the  position  or  policy  of  the  National  Institute  of  Education  ard  no 
official  endorsement  by  the  National  Institute  of  Education  should  be 
inferred. 

laboratory  of  Psychometric  and  Evaluative  Research  ReegrtNo^jO. 
Amherst,  MA:     School  of  Education,  University  of  Massachusetts,  1979. 
(2nd  edition) 


-March  10,  1979- 


Introductory  Comment s\ 


This  instructional  training  package  was  prepared  to  introduce 
practitioners  to  methods  for  developing  and  validating  criterion- 
referenced  tests  and  to  methods  for  using  and  reporting  criterion- 
referenced  test  score  information.     In  preparing  the  document  we  at- 
tempted to  accomplish  three  goals: 

1.  Provide 'a  comprehensive  presentation  of  criterion-referenced 
testing  technology; 

2.  Emphasize  the  most  recent  substantive  and  technological 
advances  in  the  field; 

3.  Emphasize  criterion-referenced  testing  contributions  that  are  ^ 
((      both  important  and  relatively  easy  to  use. 

r 

Material  in  the  Practitioner's  Guidebook  is  organized  into  ten 

 ■  \ 

units  of  instruction.   VThe  ten  unit  titles  are:  ^ 

1.  Introduction  to  Criterion-Reference'  Testing 

2.  Preparation  of  Objectives  and  Test  Items 

3.  Assessment  of  Content  Validity 

4.  Test  Assembly  and  Administration 

5.  Reliability,  Validity  and  Norms 

6.  Issues  and  Methods  for  Standard-Setting 

7.  Criterion-Referenced  Test  and  Test  Manual  Evaluations 

8.  Use  and  Reporting  of  Test  Score  Information 

9.  Design  of  Criterion-Referenced  Testing  Programs— Two  Examples 
10.  New  Developments  and  Areas  for  Further  Research 


-i- 


9 

ERIC 


3 


I  mm 

-ii- 

ii 

Each  unit  is  divided  into  several  sections.    The  first  two^usually 
provide  an  overview  to  the  unit  and  an  introduction  to  the  topics 
which  will  be  covered  in  the  unit.    The  remaining  sections  provide,  in 
a  logical  sequence,  relevant  technical  material  and  examples.  The 
J  selection  of  content  reflects  our  bias  toward  crjiterion-ref erenced  test 

methods  which  are  fairly  straightforward  to  undel&gtand,  address  satis- 
factorily thevproblems  at  hand,  and  are  relatively  easy  to  apply.  In 
,*    some  instances    it  was  necessary  to  violate  this  guideline  when  a  method 

0 

meeting  our  guideline    was    not  available.    Also,  in  a  small  number  of 
instances,  we  included  optional  material  which  we  felt  may  be  of  interest 
to  practitioners.    These  sections  are  marked  by van  "*".    They  can  be 
skipped  without  any  loss  in  continuity. 

The  final  section  of  each  unit  includes  a  list  of  cited  references. 
In  some  units,  we  included  a  second  list  of  references  to  facilitate 
further  study  of  content  introduced  in  the  unit  and/or  special  topics. 
We  included  flow-charts,  figures,  and  tables  whenever  possible  to  imnrove 
.   the  readability  of  our  materials. 

Many 'improvements  have  been  made  in  the  second  edition  of  our 
Practitioner's  Guidebook. 

1.  A  slightly  different  model  for  developing  criterion-referenced 
tests  is  proposed. 

»* 

2.  Many  new  examples  of  domain  specifications  prepared  by  curriculum 
specialists,  teachers,  and  ourselves  are  included. 

3.  New  methods  <and  rating  forms  are  offered  for  conducting  content 
validation  studies. 

4.  The  material  on  approaches  for  assessing  test  score  reliability 
is  updated  and  tables  offered  to  facilitate  the  determination 
of  test  length. 


4 


-iii- 


5.  A  unit  on  standard-setting  is  now  available  and  the  material 
in  it  is  ^reflective  of  current  issues  and  advancements  in  the 
area, 

6,  Guidelines  and <a  rating  form  for  evaluating  criterion- 
referenced  tests  and  test  manuals  are  improved. 

The  present  document  is  lengthy Still,  we  are  certain  that  several 

additional  topics  should  be  included  in  a  third  edition} 


1.  We  limited  our  discussion  to  tlie  construction  -md  uses  of 
paper  and  .pencil  tests.     (In  a  later  edition,  ^  plan  to  add 
several  units  on  performance  testing.) 

2.  The  present  document  includes  no  references  to  the  measurement 
of  affective  outcomes.     (The  interested  reader  is  referred  to 
Popham's  Criterion-Referenced  Measurement,  Prentice-Hall, 
1978.) 

V 

3.  We  plan  to  include  some  case  histories  of  teacher,  school, 
district',  and  state-wide  efforts  to  produce  criterion- 
referenced  tests. 

4.  A  unit  on  descriptive  and  inferential  statistics  for  test 
developers  .will  be  added.    Also,  we  will  include  some  material 
on  Bayesian  statistical  methods  and  decision  theory. 


~f    5.    Methods  for  studying  criterion-ref erenced  tea*  item  and  test 
score  bias  will  be  prepared  as  soon  as  sufficient  information 
is  available  on^the  topic.    For  the  moment,  we  recommend 
interested  individuals  apply  the  methods  being  advanced  for 
use  with  norm-referenced  tests. 

We  would  like  to  hear  from  individuals  who  use  the  materials. 
Specifically,  we  would  be  interested  in: 

1.  Your  general  comments  about  the  materials; 

2.  Areas  where  you  feel  we  have  been  incomplete,  ambiguous,  or 
inaccurate; 

3.  Your  suggestions  for  expanding  the  material. 

We  expect  that  the  Practitioner's  Guidebook  will  be.  published  in 
1960.     However,  until  the  materials  are  published,  the  Northwest  Regional 
Educational  Laboratory  (in  Portland,  Oregon)  has  agreed  to  distribute  the 
document  through  its  Clearinghouse. 


ERJC 


5 


.Unit  1 

Introduction  to  Criterion-Referenced  Testing 


Prepared  By 

:    Ronald  K.  Eambleton 
University  of  Massachusetts,  Amherst 

and 

Darnel  /?•  Eignor  ^ 
Educational  Testing  Service 


March  15,  L979 


Table  of  Contents 


Page 


1.0  •    Overview  of  the  Unit   1 

1.1  Introduction   2 

1.2  Minimum  Competency  Tests   8 

1.3  Comparison  of  Norm-Referenced  Tests 

and  Criterion-Referenced  Tests                                                      «  12 

1.4  Shortcomings  of  Norm-^Ref erenced  Test  s 

in  Program  Evaluation  *  15 

>1.5      Developing  and  Validating  Criterion- 
Referenced  Tests  .  .  .  .   .   21 

1.6  Using  and  Reporting  Criterion-Referenced 

Test  Score  Information.  ...........                     ...  24 

1.7  References.   26 


-i- 


ERIC 


1.0    Overview  of  the  Unit 

This  unit  was  prepared  to  introduce  readers  to  the  topic  of  criterion 
referenced  test  technology.     Specifically,  we  will  (1)  provide  some  back- 
ground,   (2)  consider  the  i&sue  of  definitions,  (3)  address  the  need  for 
a  theory  and  practice  of  criterion-referenced  tests,  (A)  compare  norm- 

4 

referenced  tests  and  criterion-referenced  tests  (5)  consider  shortcomings 
ofc  norm-referenced  tests,  and  (6)  introduce  a  framework  for  studying  the 

remainder  of  the  units.    With  regard  to  point  (6),  we  will  introduce  a 

e 

j 

twelve  step  model  for  developing  and  validating  criterion-referenced  tests. 
Also,  we  will  introduce  a  framework  for  using  and  reporting  criterion- 
referenced  test  score  information. 


-2- 


1.1  Introduction 


ERIC 


Glaser  (1963)  and  Popham  and  Husek  (1969)  were  the  first  to  intro 
duce  and  to  popularize  the  field  of  criterion-referenced  testing.  Their 
motive  was  to  provide  the  kind  of  test  score  information  needed  to  make 
a  variety  of  individual  and  programmatic  decisions  arising  in  objectives 
based  instructional  programs.    Norm-referenced  tests  were  seen  as  less 
'than  ideal  for  providing  the  desired  kind  of  test  score  information. 

Presently,  there  are  millions  of  students  at  all  levels  of  edu- 
cation  taking  criterion-referenced  tests.    Criterion-referenced  tests 
are  used  to  monitor  individual  progress  through  objectives-ba&ed  instruc 
tional  programs,  to  diagnose  learning  deficiencies,  to  evaluate  educa- 
t;Lonal  and  social  action  programs,  and  to  assess  competencies  on  various 
certification  and  licensing  examinations.     There  are  many  more  uses  as 
well. 

Unfortunately,  until  recently  (Hambleton  and  Eignor, 

1978,  1979;  Mlllman,  1974;  Popham,  1978a),  there  have  been  few 
reliable  guidelines  for  test  construction,  test  assessment,  and  test 
score  interpretation,  and  this  in  turn  has  hampered  effective  usage  of 
criterion-referenced  tests.    Over  the  years,  standard  procedures  for 
testing  and  measurement  within  a  norm-referenced  framework  have  become 
well-known  to  educators;  however,  these  procedures  are  much  less  appro- 
priate  when  the  questions  being  asked  concern  what  examinees  can  and 
cannot  do  (Glaser,  1963;  Hambleton  and  Novick,  1973;  Popham  and  Husek, 
1969).     Norm-referenced  tests  are  constructed,  principally,  to  facil- 
itate the  comparison  of  individuals  (or  groups)  with  one  another  or 
with  respect  to  a  norm  group  on  the  ability  measured  by  a  test. 


9 


-3- 

Criterion-referenced  tests  are.  constructed  to  permit  the  interpretation' 

4 

of.  individual  (and  group)  test  scores  relative  t$  a  set  of  objectives. 
Perloff,  Perloff,  and  Sussna  (1976)  noted  "the  first  recorded  instance 
of  evaluation  occurred  when  man,  woman  and  serpent  were  punished  for 
having  engaged  in  acts  which  apparently  had  not  been  among  the  objec- 
tives  defined  by  the  program  circumscribing  their  exi-s'tence. "  They 
might  have  added  that  the  "assessment  measure"  was,  a  criterion-referenced 
test.    Adam's  and  Eve's  behaviors  relative  to  some  stated  objectives 
were  compared  to  performance  standards  and  found  toabe  deficient.  (Un- 
fortunately, the  combined  failure  of  Adam  and  Eve  on  that  single 
criterion-referenced  test  has  had  a  long  range  effect  on  the  rest  of  us. 
Fortunately,  "such  long-lasting  and  far-reaching  results  of  a  person's 
criterion-referenced  test  scores  are  qnusual.) 

As  an  alternative  to  norm-referenced  tests,  criterion-referenced 
tests  were  introduced.      Criterion-referenced  cests  are  intended  to  meet 
the  testing  and  measurement  requirements  in  objectives-based  instructional 
programs,  competency-based  certification  programs,  and  numerous  other 
situations  where  someone  is  interested  in  the  performance  of  examinees 
relative  to  a  set  of  objectives  or  competencies. 

The  last  time  anyone  bothered  to  count,  there  were  over  '600 
references  cn  the  topic  of  criterion-referenced  testing.  Unfortunately, 
'there  are  almost. as  many  ideas  about  what  a  criterion-referenced  test 
is  as  there  are  contributors  to  the  field.     Ross  Traub  put  it  best  when 
he  suggested  that  some  new  contributions  were  akin  to  "stirring  muddy 
water . " 


9 

ERIC 


10 


-4- 

One  of  the  major  sources  of^ confusion  is  over  the  word  "criterion." 

c 

For  many  individuals,  'it  refers  to  a  performance  standard,  a  minimum  pro- 
ficdency  level,  or  a  cut-off  score.     Bbt  it  is  clear  from  the  two  most 

influential  criterion-referenced  testing  papers  in  the  1960?s  (Glaser. 

,  *■  > 

1963;  Popham  and  Husek,  1969)  that  these  writers  us£ci  the  word  "criterion" 
to  refer  to  a  "domain  of  behaviors.11    These  authors  were  interested  in 
referencing  examinee  test  performance  to  a  well-defined  domain  of  behaviors 
measuring  an  objective  or  competency.     For  further  discussion  of  criterion-' 
referenced  test  definitions,  the  interested  reader  is  referred  to  Donlon 
(1974),  Hambleton  and  Novick  (1973),  Mlllman  (1974),  and  Popham  (1978a). 

Popham  (1978a)  provides  the  definition  we  prefer  and  -the  one  we 
will  work  with  in  our  instructional  materials.     It  is: 

A  criterion-referenced  test  is  used  to  ascertain  an 
individuals  status  [referred  to  as  a  domain  score] 
with  respect  to  a  well-defined  behavior  domain. 

It  is  often  the  case  that  a  criterion-referenced  test  will  measure  more 
than  a  single  objective.     If  so,  items  within  the  test  are  organized 
into  non-overlapping ^subtests  corresponding  to  the  objectives  measured 
by  the  test.     Popham1 s  definition  is  similar  to  the  one  offered  by 
Millman  (1974)  and  others  for  a  domain-referenced  teat.     The  term, 
"domain-ref erenced  test,"  is  a  good  one  because  it  is  descriptive  and 
therefore  less  apt  to  be  misunderstood  by  practitioners*     However,  we 
agree  with  W.  James  Popham  (one  of  the  leading  contributors  to  the  field 
of  criterion-referenced  testing  and  a  strong  advocate  for  criterion- 
referenced  testing)  on  the  matter  of  which  "test  label"  is  the  most 
useful .     Presently,   there  is  considerable  public  support  for  the  term, 


"criterion-  ,/ferenced  tests11  and  therefore  we  feel  it  would  be  unfor- 
tunate if  a  new  campaign  had  to  be  initiated  by  educatovs  for  the  tprm, 
ndoraain-ref erenced  tests.11    It  is  true  though  that  few  so-called 
"criterion-referenced  tests"  could  satisfy  the  demands  required  by  the 
criterion-referenced  test  definition  offered  above. 

Currently,  there  is  confusion  over  the  differences  among  three 
kinds  01  tests  —  criterion-referenced  tests,  domain-referenced  tests, 
and  objectives-referenced  tests.     If  Popham's  definition  of  a  criterion- 
referenced  test  is  adopted,  there  is  no  essential  difference  between 
criterion-referenced  tests  and  domain-referenced  tests.  Objectives- 
referenced  tests  are  tests  consisting  of  items  that  are  matched  to  ob- 
jectives.   The  primary  distinction  between  criterion-referenced  tests 
and  objectives-referenced  tests  is  as  follows:    In  a  criterion-referenced 
test,  the  items  are  a  representative  set  of  items  from  a  clearly  defined 
domain  of  behaviors  measuring  an  objective, whereas,  with  an  objectives- 
referenced  test,  no  domain  of  behaviors  is  specified,  and  items  ,are  not 
.considered  to  be  representative  of  any  behavior  domain.    This  distinction 
has" important    implications  for  the  kinds  of  generalizations  that  can  be 
made  from  criterion-referenced  test  scores  as  compared  to  objectives- 
referenced  test  scores  and  is  the  reason  we  prefer  Popham's  definition. 
It  is  interesting  to  note  that  most  (if  not  all)  commercially  prepared 
"criterion-referenced  tests"  on  the  market  today,  would  be  called 
"objectives-referenced  tests"  if  Popham's  definition  for  a  criterion-  < 
referenced  test  is  adopted. 

With  the  availability  of  a  test  theory  for  norm-referenced  measure- 
ments, procedures  exist  for  constructing  appropriate  measuring  instruments, 
i.e.,  norm-referenced  tests.     Since  the  primary  purpose  for  norm-reference 

.12 


9 

ERIC 


-6-  ' 

tests  and  criterion-referenced  tests  are  fundamentally  different,  it  is  not 
surprising  that  a  different  theory  and  practice  of  testing  is  needed 
to  handle  the  problem  of  testing  to  assess  competence.     It  should  be 
noted  that  a  norm-raf erenced  test  can  be  used  for  criterion-referenced 
measurement,  albeit  with  some  difficulty,  since  fhe  selection  of  items 
is  such  that  many  objectives  will  very  likely  not  be  covered  on  the 
test  or,  at  best,  will  be  covered  with  only  a  few  items.    It  has  been 
noted  by  at  least  two  writers  (Millman,  1974;  Traub,  1972)  that  when 

items  in  a  norm-referenced  test  can  be  matched  to  objectives,  criterion- 

£ 

referenced  interpretations  of  the  scpres  are  possible,  although  they 
are  quite  limited  in  generalizability. 

A  criterion-referenced  test  constructed  by  procedures  especially 
designed  to  facilitate  criterion-referenced  measurement  can  be  and  some- 
times  is  used  to  make  norm-referenced  measurements.    However,  a  criterion- 
referenced  test  is- not  constructed  specifically  to  maximize  the  vari- 
ability of  test  scores  (whereas  a  norm- referenced  test  is).    Thus,  since 
the  distribution  of  scores  on  a  criterion-referenced  test  will  tend  to 
be  more  homogeneous?  it  is  obvious  that  such  a  test  will  be  less  useful 
for  ordering  individuals  on  the  measured  ability.     In  summary,  a  norm- 
referenced  test  can  be  used  to  make  criterion-referenced  measurements, 
and  a  criterion-referenced  test  can  be  used  to  make  norm-referenced 
measurements,  but  neither  usage  will  be  particularly  satisfactory 
(Hambleton  and  Novick,  1973). 

It  has  been  argued  by  some  test  developers  that  to  refer  to  tests 
as  either  norm-referenced  or  criterion-referenced  tests  may  be  mislead- 
ing s^nce  measurements  obtained  from  either  testing  instrument  can  be 
•  given  a  norm-referenced  interpretation,  criterion-referenced  interpretation, 

13  \ 


or  both.    The  important  distinction  made  was  that  between  norm-referenced 
measurement  and  criterion-referenced  measurement  (Glaser,  1963;  Hambleton 
and  Novick,  1973).     From  a  historical  po*"»i>er  t  i  ve,  this  d\  tincfion  'is 
important  "ince  a  methodology  for  constructing  criterion-referenced  tests 
did  not  exist,  at  least  at  the  time  of  Glaser's  article.  Criterion- 
referenced  tests  were  constructed  in  the  same  manner  as  norm-referenced 
tests,  and  as  pointed  out  above,  the  usage  was  not  satisfactory.  However, 
in         view  of  recent  developments  in  the  field,  it  would  be  correct  to 
refer  to  a  test  as  either  criterion-referenced  or  norm-reff  renced.  In 
fact,  given  the  operational  definitions,  the  distinctions  between 
criterion-referenced  tests  and  norm-referenced  tests  are  both  unambiguous 
and  meaningful. 

Of  course,  not  all  educators  agree  on  the  usefulness  of  criterion- 
referenced  tests  (see,  for  example,  the  debates  between  Block  [l9,7lj  and 

ft 

Ebel  [1971]  and  between  Popham  [l978b]  and  Ebel  [1978]) .  Our  position  is 
that  criterion-referenced  tests  can  serve  a  wide  variety  of  uses,  and  their 

usefulness  will  be  enhanced  through  knowledge  and  understanding  of 

i 

technical  developments  which  address  their  proper  construction,  valida- 
tion, and  usage. 


is* 

'  1.4 


c 


-8- 

1.2    Minimum  Competency  Tests 

The  establishment  of  minimum  competency  testing  programs  in 
elementary  and  secondary  schools,  and  for  many  professions,  has  reached 
immense  proportions  (or  epidemic  proportions,  if  you  view  the  trend 
negatively).    For  example, . well  over  half  (33  to  be  exact)  of  the  states 
have  passed  legislation  requiring  assessment  of  the  "competence"  of 
their  elementary  and  high  school  students  (Pipho,  1978).    Further,  many 
of  these  states  require  that  students  demonstrate  at  least  a  minimum 
level  of  performance  on  a  set  of  competencies  in  order  to  receive  a 
high  school  graduation  diploma.    Why  are  so  many  state  legislatures  / 
mandating  minimum  competency  testing?     It  appears  that  it  is  to  dis- 
courage  schools  from  the  practice  of  promoting  all  students  and  awarding 
high  school  graduation  diplomas  based  on  school  attendance  only.     It  is 
common  for  legislators  and  parents  to  say  that  minimum  requirements  in 
the  "basic  skills"  must  be  set  for  students  to  graduate  with  a  diploma 
which  has  some  meaning. 

The  rapidity  of  change  in  school,  district,  and  statewide  testing 
programs  and  the  demand  for  high  quality  tests  has  dictated  that  sub- 
stantial  research  and  development  work  be  undertaken.     Included  among 
the  more  important  research  and  development  topics  are:  Identification 
and  definition  of  competencies,  management  of  competency  testing  pro- 
grams, development  and  validation  of  competency  tests,  methods  of 
determining  standards,  and  uses  and  interpretations  of  competency  test 
scores. 


15 

ERIC 


Competency  testing  technology  would  be  in  an  embryonic  stage 
were  it  not  for  the  work  done  in  developing  a  criterion-referenced 
testing  technology  since  the  late  I960's.    A  competency  test  is  simply 
a  particular  kind  of  criterion-referenced  test  and  therefore,  like1  a  ^ 
criterion-referenced  test,  it  must  be  developed  and  used  in  Ways  some- 
what different  to  better-known  norm-referenced  tests.    All  (or  nearly 

all)  which  follows  in  this  Practitioner's  Guidebook  will  apply  equally 
0 

well  to  both  criterion-referenced  tests,  and  what  have  come  to  be  known 

ft 

as  competency  and  minimum  competency  tests.  & 
Perhaps  we  should  begin  with  a  definition  .of  a  competency  test: 

A  competency  test  is  designed  to  determine  an 
examinee's  level  of  performance  relative  to 
each  competency  being  measured.    Each  compe- 
tency is  described  by  a  well-defined  behavior 
domain  (Hambleton  and  Eignor,  1979). 

Clearly,  competency  tests  and  criterion-referenced  tests  are 
equivalent.    Perhaps  the  only  differences  are  the  contexts  "in  which  the 
„terms  are  used  and  the  characteristics  measured  by  the  tests.  However, 
there  is  no  need  for  two  expressions.      The  definition  makes  clear  that 
the  purpose  of  a  competency  test  is  to  provide  information  about  an 

individual  examinee's  level  of  performance  on  each  competency  which  is 

*\  ■ 

measured  by  a  test.    There  will  be  as  many  test  scores  as  there  are 
competencies  measured  by  a  test.    Also,  competencies  are  clearly  written 
so  that  there  will  be  a  high  level  of  agreement  among  users  of  the  test 
about  the  content  (behaviors)  defining  the  competency.    This  de- 
sirable goal  can  be  accomplished  through  the  use  of  "domain  specifica- 
tions" (Popham,  1978a).    This  terra  will  be  described  in  more  detail  later. 
There  is  one  other  point.     There  is  nothing  inherent  in  the  definition  of 
a  competency  test  which  requires  test  scores  to  be  compared  to  "standards. 

16 


-10- 


In  fact,  the  percentage  scores  (reported  by  competency)  provide  excellent 

descriptive  information  about  examinee  performance.     Since  it  is  common, 

however,  to  interpret  examinee  test  performance  relative  to  standards 

(an  examinee  who  scores  equal  to  or  above  a  standard  set  at  70%  [say] 

on  the  set  of  test  items  included  in  a  competency  test  is  described 

as  a  ''master11  or  "competent"),  it  is  necessary  to  introduce  a  new  term, 

"minimum  competency  testing."  ^ 

A  minimum  competency  test  is  designed  to  determine 
whether  an  examinee  has  reached  a  prespecified  level 
of  performance  relative  to  each  competency  being  ' 

measured.    The  "prespecified  level"  or  "standard"   

may  vary  from  one  competency  to  the  next.  Also, 
each  competency  is  described  by  a  well-defined 
behavior  domain. 

A  "standard"  (sometimes  it  is  called  a  "cut-off  score"  or  a  "minimum 
proficiency  level")  is  a  point  on  a  test  score  scale  which  is  used  to 
separate  examinees  into  two  categories,  each  reflecting  a  different  level 
of  proficiency  relative  to  the  competency  measured  by  the  test  under 
consideration.     It  is  common  to  assign  labels  such  as  "master"  or 
"competent"  to  those  persons  in  the  higher-scoring  category  and  "non 
master"  or  "competent"  to  those  persons  in  the  lower-scoring  category. 
Note  that  if  a  test  measures  more  than  a  single  competency  and  if  examinees 
are  to  be  classified  into  competency  categories  based  on  their  performance 
on  each  set  of  items  measuring  a  competency,  as  is  often  the  case,  a 
standard  is  set  for  each  competency  measured  by  the  test.    There  ..can  be 
as  many  competency  decisions  as  there  are  competencies  measured  by  the 
test. 


From  the  definitions  above,  it  is  clear  that  minimum  competency 
tests  are  a  special  type  of  competency  test  (tests  where  standards  are 
introduced  to  interpret  examinee  performance)  and  as  we  mentioned 
earlier,  competency'' tests  are  a  special  type  of  criterion-referenced 
test  (i.e.,  those  tests  which  are  used  usually  in  certification  and 
licensing  situations). 


-12- 


1,3    Comparison  of  Norm-Referenced  Tests 
and  Criterion-Referenced  Tests 

Similarities  and  differences  between  norm-referenced  tests  and 

criterion-referenced  tests  are  summarized  below  under  nine  topics. 


Purpose 

A  norm-referenced  test  is  designed  to  facilitate  comparisons 
among  examinees  on  the  ability  being  measured. 

A  criterion-referenced  test  is  designed  to  assess  an  examinee's 
level  of  performance  relative  to  a  well-defined  behavior  domain. 


Test  Development  Method  _       

For  a  norm-referenced  test,  a  test  blueprint  is  prepared  and 
items  are  written  according  to  the  blueprint.    An  important 
factor  in  item  selection  is  the  statistical  properties  of  the 
test  items  (item  difficulty  and  discrimination) .    In  general, 
items  of  moderate  difficulty  (p-values  in  the  range  .30  to  .70) 
and  high  discriminating  power  (point  biserial  correlations 
over  .30)  are  the  most  likely  to  be  selected  for  inclusion 
in  a  test  because  they  contribute  substantially  to  test  score 
variance.    Test  reliability  and  validity  will,  generally, 
be  higher  when  test  score  variance  is  high. 

For  a  criterion-referenced?  test,  domain  specifications  are 
prepared  and  items  written  to  measure  the  domain  specifications. 
T^.st  items  are  selected  for  a  criterion-referenced  test  if 
they  are  "reflective11  of  the  domain  they  were  written  to  measure 
and  if  they  can  serve  as  a  "representative"  set  of  test  items 
defined  by  the  domain  specification.     (A  domain  specification 
represents  an  attempt  to  clearly  define  the  behavior  domain 
associated  with  a  particular  objective  or  competency.) 


Measurement  Scales 

The  norm-referenced  test  score  scale  is  anchored  in  the 
middle  (the  average  level  of  group  performance). 

For  criterion-referenced  test  score  scales,  the  anchor  points 
are  two  in  number  and  located  at  the  ends  of  the  scale  (0% 
and  100%). 


Test  Score  Uses 


Norm-referenced  test  scores  are  often  used  to  make  comparisons 
among  examinees  or  to  handle  "fixed  quota11  selection  problems 

19 


-13- 


(i.e.,  the  problem  when  there  is  a  fixed  number  of  "vacancies" 
and  the  number  of  applicants. exceeds  the  number  of  vacancies). 

Criterion-referenced  test  scores  are  used  (1)  to  make  descrip- 
tive statements  about  what  examinees  'can  do,     (2)  to  make 
instructional  decisions,  and  (3)  to  evaluate  programs  and 
their  effectiveness.    Examinees  are- judged  primarily  on  their 
own  merits.    There  are  instances  where  examinees  (or  groups) 
may  be  compared  with  one  another,  but  this  is  not  a  primary 
use  of  the  scores.    Criterion-referenced  tests  are  often  u'.  i 
in  "quota-free"  selection  problems  (i.e.,  situations  when  .  > 
limits  are  placed  on  the  number  of  examinees  receiving  a  ^ 
"passing  score").  f 

An  important  point  to  note  is  that  both  norm-referenced  tests             .  . 
and  criterion-referenced  tests  "sort"  individuals.  However, 
norm-referenced  tests  are  used  to  sort  examinees  according  to 
their  ..performance  on  the  test  and  criterion-referenced  tests 
are  used  to  sort  examinees  into  groups  according  to  their 
-mas-tery-  or--non=ffiast-er-y-of-ski-lls  measured  -by~,a-fcest.    


Test  Score  Generalizability 

•  There  is  seldom  interest  in  making  generalizations  from  norm- 
referenced  test  scores.     Usually,  the  job  is  completed  when 
test  scores  are  compared  to  appropriate  norms  tables. 

« 

With  criterion- referenced  test  scores,  the  matter  of  generaliz- 
ability is  important.     Seldom  would  anyone  be  content  to  inter- 
pret an  examinee's  score  in  terms  of  the  specific  items  on  a 
test.     (Incidentally,  this  is  all  that  can  be  appropriately 
done  with  scores  obtained  from  objectives-referenced  tests.) 
If  the  objective  measured  by  a  test  is  clear,  and  if  items 
are  selected  to  be  representative  of  the  behavior  domain 
defining  the  objective,  examinee  test  performance  on  a  set 
of  items  included  in  the  test  'can  be  generalized  to  test 
performance  in  the  larger  domain  of  behaviors.  Strong 
criterion-referenced  test  score  interpretations  of  the  kind 
just  described  are  usually  of  interest  to  criterion-referenced 
test  users.     (So  much  so  that  they  usually  make  a  strong, 
interpretation  whether  justified  or  not!) 


20 


/ 


-14- 


Specificity  of  Test  Score  Information 

A  norm-referenced  test  provides  a  summary  of  a  somewhat  abstract 
area  of  achievement. 


A  criterion-referenced  test  provides  very  specific  and  detailed 
information  about  a  clearly  defined  area  of  achievement. 


Users  of  norm-referenced  tests  espouse  the  view  that  learning 
is  a  complex  process  consisting  of  concepts  and  relationships 
organized  in  a  hierarchical  arrangement. 

Users  of  criterion-referenced  tests  are  endorsing  the  notion 
that  things  learned  can  be  separated  into  discrete  categories 
( referred  to  as  ob j  ecti ves nsttllTs7~dr  compe tencies)  , 


Reliability' and  Validity  Issues 

For  both  types  of  tests,  reliability  and  validity  considerations 
are  important.    However,  since  test  score  reliability  and 
validity  are  also  specific  to  the  intended  uses  of  the  scores, 
and  since  norm-referenced  test  scores  and  criterion-referenced 
test  scores  are  used  to  address  different  types  of  problems, 
it  is  not  surprising  that  approaches  for  assessing  reliability 
and  validity  will  differ  with  each  type  of  test. 


Norms  tables  are  of  central  importance  with  norm-referenced 
tests.     Norms  can  also  be  of  value  when  interpreting  individual 
and  group  criterion-referenced  test  scores. 


Instructional  Applications 


Norms 


eric 


-15- 


1.4    Shortcomings  jf  Norm-Referenced 
Tests  in  Program  Evaluation1 

Evaluators  are  confronted  with  questions  such  as,  "What  is  the 
average  level  of  performance  on  a  particular  set  of  mathematics  ob- 
jectives for  a  specified  group  of  individuals?"    or  "How  much  has  a 
particular  group  learned  from  a  special  reading  program?"    A  common 
question  asked  of  Title  I  program  evaluators  is  "Have  X%  of  a  group  of 
participating  students  "mastered"  over  Y%  of  the  reading  program  objec- 
tives?"   It  is  common  practice  for  individuals  presented  with  these 
questions  to  turn  to  one  of  Oscar  Buros'  Mental  Measurements  Yearbooks 

(the  eighth    edition  was  published  in  1978)  and  search  for  a  -suitable  

assessment  instrument.    But  the  search  is  likely  to  be  a  long  and  frus- 
trating one.    The  great  majority  (probably  over  95%)  of  the  instruments 
found  in  the  Yearbooks  are  norm-referenced  instruments.    That  is,  the 
instruments  were  designed  primarily  to  permit  comparisons  of  one  individ- 
ual with  another  on  the  construct  or  ability  measured  by  the  instrument. 
Grade-equivalent  scores,  age-equivalent  scores,  percentile  ranks,  and 
standard  scores  are  common  ways  of  reporting  individual  test  performance. 
All  reporting  methods  permit  comparisons  among  individuals,  but  they 
provide  little  or  no  information  relative  to  important  questions  such  as, 
"What  can  an  individual  (or  group)  do?" 

Because  they  are  used  so  frequently,  you  might  think  (or  be 
tempted  to  conclude)  that  norm-referenced  tests  provide  excellent  indi- 
cators of  program  effectiveness.    Federal  agencies,  school  boards, 

lFrom  a  paper  by  Hambleton,  R.K. ,  &  Gifford,  J.  C.  Development  and 
use  of  criterion-referenced  tests  to  evaluate  program  effectiveness. 
Laboratory  of  Psychometric  and  Evaluative  Research  Report  No.  52.  Amherst 
MA:     School  of  Education,  University  of  Massachusetts,  1977. 


22 


-16- 


program  evaluators,  and  even  parent  groups  often  request  that  they  be 
administered.    Why?    Well,  certainly  it  could  be  argued  that  they  are  ob- 
jective, and  usually  norm- referenced  tests  are  developed  and  distributed 
by  companies  with  years  of  experience  in  the  area  of  testing.  Also, 
the  costs  are  usually  low  (at  least  when  compared  to  the  costs  of  a 
program  developing  and  validating  its  own  evaluative  instrument).  But, 
and  this  is  the  critical  point,  it  is  so  difficult  to  purchase  a  norm- 
referenced  test  where  the  content  of  the  test  will  closely  approximate 
the  objectives  or  goals  of  some 'specif ic  program. 

Let  us  back  up  now  and  look,  more  closely  at  the  construction  and  : 

t, 

uses  of  norm-referenced  tests.    A  norm-referenced  test  is  a  test  that 
is  designed  to  facilitate  the  comparison  of  individuals  with  respect 
to  one  another,  or  some  appropriately  chosen  norm  grtoup.    Since  the  major 
purpose  of  norm-referenced  tests  is  to  facilitate  comparisons,  the 
information  obtained  from  a  norm-referenced  t6st  is  information  concern- 
ing an  individual  and  is  best  used  to  make  decisions  about  individuals. 
Consequently.,  norm-referenced  tests  are  often  used  for  selecting^.,  placing, 
and  counseling  individuals.     In  order  for  a  norm-referenced  test  to  be 
effective  in  making  meaningful  comparisons,  it  is  important  that  there 
be  variability  in  individual  responses,.    This  is  accomplished  by 
selecting  test  items  from  a  larger  pool  of  available  test  items  for  an 
instrument  because  they  have  moderate  item  difficulty  levels  (typically 
in  the  range  .30  to  .70)  and  moderate  to  high'  discrimination  indices 
(point  biserial  correlations  over  about  .30).     Other  things  being  equal, 
items  so  selected  will  tend  to  maximize  test  score  variability.  Vari- 
ability spreads  the  examinees  over  the  ability  scale  and  allows  the 

23 


•  9 


ERIC 


-17- 

•  user  co  make  meaningful  comparative  statements  about  an  individual  in 
terms  of  the  group  as  a  whole.    Without  variability,  all  examinees 
would  be  receiving  the  same  scores  and  no  useful  information  would  be 
obtained.    In  norm-referenced  testing,  all  meaning  is  dependent  entirely 
on  the  comparison  of  the  individual  to  others. 

The  most  commonly  used  norm- referenced  tests  are  those  prepared 
by  publishers  to  be  used  extensively  throughout  the  country.    Test  items 
are  chosen  .to  measure  content  or  goals  common  to  a  wide  sample-  and  types 
of  programs.    The  norm-referenced  test  provides  an  overview  of  an 
individual's  relative,  ability  in  a  rather  broad  content  area.  The 
scores  are  reported -aa~raw..  scores_  and  one  or  more  derived  scores  (for 

example,  percentiles,  age- or  grade-equivalent  scores,  and "standard.  

scores).  The  raw  scores  alone  have  very  little  meaning.  Infer- 

ences cannot  be  made  as  to  what  the  student  knows  or  does  not  know.    The  derived 
scores  give  specific  information  concerning  the  relation  of  an  individual's 
knowledge  or  ability  to  that  of  a  particular  reference  group. 

With  the  increase  of  concern  for  accountability  and  the  efficient 
use  of  tax  dollars,  program  evaluation  has  taken  a  position  of  tremen- 
dous importance.    Government  agencies,  for  example,  need  the  kind  of 
information  that  will  enable  them  to  make  the  most  effective  decisions. 
One  of  the  most  commonly  used  measures  of  program  effectiveness  is  the 
student  test  score,  in  particular,  the  norm-referenced  standardized  test 
score.    If  this  is  to  be  the  case,  it  is  crucial  that  the  tools  chosen 
for  evaluation  be  ideally  suited  for  answering  the  questions  an  evaluator 
asks.    For  several  reasons  norm-referenced  tests  are  not  well-suited  for 
the  measurement  of  program  effectiveness.    However,  there  are  others 
who  take  a  different  view  (Ebel,  1971,  1978). 

*  24 


-18- 

One  shortcoming  of  ^orm-ref erenced  tests  in  program  evaluation 
is  the  discrepancy  between  the  content  covered  by   a     test  and  the 
content  of    a    program  that  is  being  evaltf£fcefrf,    Reasons  for  the  dis- 
crepancy  relate  to  the  basic  construction  and  the  misuse  of  norm- 
referenced  tests.    The  tests  that  are  most  commonly  used  in  evaluations 
are  used  nationwide  and  are  based  on  an  amalgamation  of  objectives  of  prQ 
grams  from  all  t>ver  the  country.    Each  program  has  difie-rent  objectives 
and  different  times  when  the  instruction  of  particular  objectives  occurs. 
The  overlap  of  program  objectives  and  test  objectives  will  not  be  complet 
and  the  degree  of  overlap  will  change  from  program  to  program.    This  is 
particularly  true  in  compensatory  educational  programs,  where  the  ob- 
jectives may  be  more  basic  and  specific  than  the  general  objectives 
reflected  in  norm-referenced  tests. 

It  is  often  hard,  to-  find  a  standardized  achievement  measure  where 
the  content  covered  by  the  measure  closely  matches  the  content  goals  of 
a  particular  program  being  evaluated.    Therefore,  any  evidence  from  the 
achievement  measure  can  always  be  discarded.     When  the  match  between  test 
content  and  program  content  is  low,  we  have  nothing  of  value.  Since 
each  program  curriculum  typically  reflects  the  people  teaching  the  pro- 
gram and  their  priorities  and  emphases,  this  "mismatch"  is  comjnonly 
encountered  in  program  evaluation  studies. 

A  second  cause  of  the  discrepancy  between  test  content  and  pro- 
gram objectives  arises  directly  from  a  major  purpose  of  norm-referenced 
tests,  i.e.,  to  compare  an  individual's  performance,  knowledge* or  skill 
to  that  of  some  reference  group.     In  order  to  effectively  obtain  t\\is 
type  of  information  from  a  test,  the  test  must  be  constructed  with  that 


purpose  in  mind.    Consequently,  norm-referenced  tests  consist  of  test 
items  that  contribute  most  to  maximizing  test  score  variability.     In  the 
process  of  choosing  items'  that  contribute  variability,  those 
contributing  low  variability  are  eliminated.    It  is  clear  that 
items  tapping  concepts  taught  successfully  by  a  great  number  of  teachers 
will  contribute  little  to  test  score  variability  (most  students  will 
answer  the  items  correctly)  and  will  be  eliminated,  while  the  items 
measuring  pure  reasoning  ability  will  have  greater  variability  and  will 
be  retained.    As  a  result  of  the  process,  the  test  begins  to  look  less 
like  an  achievement  test  and  more  like  ah  aptitude  test.    The  process 
of  item  selection  puts  a  distance  between  the  curriculum  of  the  educa- 
tional program  and  the  tool  used  to  evaluate  it.    The  test "would  be 
sensitive  to  the  aptitude  of  the"  individuals  rather  than  the  effective- 
ness of  the  instruction.    If  an  instrument  is  to  be  sensitive  to  the 
learning  process,, its  content  must  be  very  carefully  matched  to  that  of 
the  program.     It  is  being  said  more  and  more  that  norm-referenced  tests 
function  like  IQ  tests.    Test  items  where  performance  is  high  (perhaps 
reflecting  areas  of  successful  teaching)  are  typically  removed  because 
they  fail  to  discriminate.     In  other  words,  many  school-related  skills 
are  systematically  eliminated.    What  we  are  left  with  is  variation  due. 
to  the  effects  of  "non-school-related  variables." 

Presently,  many  of  the  programs  to  be  evaluated  are  innovative. 
Not  only  are  the  instructional  methods  different,  but  often  the  goals 
and  objectives  of  the  program  are  different  from  those  of  the  tradi- 
tional program.    As  a  result,  a  score  doesn't  represent  knowledge- 
in  terms  of  the  instruction.     It  is  a  mistake  to  judge  an  innovative 


26 


program  according  to  the  standards  of  a  traditional  program*    The,  ef- 
fectiveness of  a  program  cannot  be  measured  by  a  tool  that  has  been 
developed  to  measure  something  else. 

Other  problems  with  norm-referenced  tests  result  not  from  the 
basic  construction  but  from  the  use  of  the  tests..    Ih  many  "cases,  the 

0 

program  to  be  evaluated       deals     with  a  population  that  is  not  re- 
flected by  the  norm  group  of  the  test.    This  has  implications  for 

ie  interpretation  of  scores  for  many  types  of  compensatory  and 
special  educational  programs. 

There  is  ye>v.  another  problem.    Standardized  or  published  norm- 
referenced  tests  can  be  criticized  also  on  the  grounds  that  they  are 
too  general.    Of  course,  they  have  this  feature  to  give  them  broad 
appeal,  but  the  more  general  the  test,  the  easier  it  ;jis  for  people  to 
see  what  they  want  in  the  results. 

v- 


27 


-21- 


1,5    Developing  and  Validating 
Criterion-Referenced  Tests 

Figure  1,5 ,1  from  Hambleton  and  Eignor  (1979)  piesents  a  twelve 
step  model  for  developing  and  validating  griterion-ref erenced  tests. 
The  importance  ,of  each  step  in  the  model  depends  upon  the  size  and 
scope  of  the  test  development  and  validation  project.    An  agency  with 
the  responsibility  of  producing  u  criterion-referenced  test  for  state- 
wide use 'will  proceed  through  the  steps  in  a  rather  different  fashion 
than  will  a  small  consulting  firm  or  a  teacher  producing  a  classroom  , 
test  on  a  very  limited  budget. 

'  In  brief,  the  twelve  steps  are  as  follows: 

Step  1 — Objectives  must  be  prepared  or  selected  before  the 
test  development  process  can  begin. 

Step  2 — Test  specifications  are  needed  to  clarify  the  test's 

purposes,  desirable  item  formats,  number  of  test  items, 
instructions  to  item  writers,  etc. 

Step  3 — Items  are  prepared  to  measure  objectives  included  in 

the  test  (or  tests,  if  there  are  going  to  be  parallel- 
forms  j  or  levels  of  a  test  varying  in  difficulty). 

Step  4 — Initial  editing  of  items  is  completed  by  the  individuals 
writing  them. 

Step  5— A  systematic  assessment  of  items  prepared  in  stepxs  2  and 
3  is  conducted  to  determine  the  item  validities.  Es- 
sentially, the  task  is  to  determine  the  content  validity 
of  the  test  items. 

Step  6--Based  on  the  data  from  step  5,  it  is  possible  to  do 
further  item  editing,  and  in  some  instances,  discard 
items  that  do  not  adequately  measure  the  objectives 
they  were  written  to  measure. 

Step  7 — The  test  (or  tests)  must  be  assembled. 

Step  8— A  method  for  setting  standards  to  interpret  examinee 
test  performance  is  selected,  and  implemented. 

Step  9 — The  test  (or  tasts)  can  be  administered. 

Step  10-Data  addressing  reliability,  validity,  and  norms 
should  be  collected  and  analyzed. 


9 

ERIC 


28 


r 


-22- 


.Step  11 — A  user's  manual  and  a  technical  manual  should  be^, 
prepared. 

Step  12 — This  step  is  included  to  reinforce  the  point  that  it 

is  necessary ,  in  an  on-going  way,  to  compile  tectttiical 
data  on  the  test  items  and  tests  as  they  are  used  In 
different  situations  with  different  examinee  popula- 
tions. 

The  next  six  units  (units  two  to  seven)  will  provide  details  for 
successfully  completing  each  of  the  steps  in  the  test  development 
and  validation  model  presented  in  Figure  1.5.1.    The  chart  below  summariz.es 
the  location  of  instructional  material  on  each  of  the  twelve  steps  in 
the  units  of  instruction  which  follow: 


Steps 

Unit 

1,2,3,4, 

2 

5,6 

3 

7,9 

h 

8 

6 

10 

5 

11,12 

7 

29 


-23- 


2/28/70 


1.  Writing  and/or  Selection  of  Objectives 

2.  Preparation  of  Test  Specifications  (for  example,  Available  Time, 
Selection  of  Objectives  to  be  Measured  by  a  Test,  Number  of  Test 
Items/Objective,  Appropriate  Vocabulary,  Methocf  of  Scaring) 

3.  Writing  Test  Items  "Matched"  to  Objectives 

♦ 

4.  Preliminary  Review  of  Test  Items 

5.  Determination  of  Content  Validity  of  the  Test  Items 

(a)  Involvement  of  Content  Specialists 

(b)  Collection  and  Analysis  of  Examinee  Item  Response  Data 

i 

6.  Additional  Editing  of  Test  Items 

7.  Test  Assembly 

(a)  Determination  of  Number  of  Test  Items/Objective 

(b)  Test  litem  Selection 

(c)  Preparation  of  Directions  and  Sample  Questions 

(d)  Layout  and  Test  Booklet  Preparation 

(e)  Preparation  of  Scoring  Keys 

(f)  Preparation  of  Answer  Sheets 

8.  Standard  Setting  for  Interpreting  Examinee  Test  Performance 
Test  Administration 

10.  Assessment  of  Test  Score  Reliability  and  Validity;  Compilation  of 
Test  Score  Norms  (Optional) 

11.  Preparation  of  a  User's  Manual  and  a  Technical  Manual 

12.  Periodic  Collection  of  Additional  Technical  Information 

i 

Figure  1.5.1.     Steps  for  Developing  and  Validating  Criterion- 
Referenced  Tests. 

/  j 

.30 


1.6    Using  and  Reporting  Criterion-  , 
Referenced  Test  Score  Information 

Figure  1.6.1  outlines  the  content  of  Unit  8.  Speci- 
fy 

fically,  we  will  consider  two  primary  uses  of  "criterion-referenced  test 
scores:     (1)  Domain-score  estimation,  and  (2)  mastery  Status  determination. 
We  also  discuss,  but  to  a  lesser  extent,  the  use  of  criterion-referenced 
test  scores  for  program  evaluation.    For  each  use,  we  will  consider 
appropriate  methods  for  applying  criterion-referenced  tests.  Finally, 
we  consider  a  number  of  ways  of  reporting  test  score  information,  and  we 
consider  the  use  of  criterion-referenced  tests  for  grading  purposes. 


I 

4  * 


-25- 


1. 

Uses  of  Criterion-Referenced  Tests 

» 

2. 

Domain-Score  Estimation 

(a)  Selection  of  an  Estimation  Method 

3. 

Mastery  Status  Determination 

(a)  Selection  of  a  Decision  Model 

(b)  Loss  Specification 

4. 

c 

Reporting  of  the  Information 

(a)  Individual  Level 

(b)  Group  Level 

(c)  Program  Evaluation 

5, 

Criterion-Referenced  Grading 

Figure  1.6.1    Using  and  Reporting  Criterion-Referenced 
Test  Score  Information 


-26- 


1.7  References 


ERIC 


Block,  J.  H.    Criterion-referenced  measurements:    Potential.  School 
Review,  1971,  69,  289-298. 

Donlon,  T.  F.    Some  needs  for  clearer  terminology  in  criterion- 
referenced  testing.    Paper  presented  at  the  annual  meeting  of 
the  National  Council  on  Measurement  in  Education,  Chicago, 
1974. 

Ebel,  R.  L.    Criterion-referenced  measurements:    Limitations.  School 
Review,  1971,  69,  282-288. 

Ebel,  R.  L.    The  case  for  norm-referenced  measurements.  Educational 
Researcher,  1978,  15,  321-327. 

Glaser,  R.    Instructional  technology  and-  the  measurement  of  learning 
outcomes.    American  Psychologist,  1963,  18,  519-521. 

l 

Glaser,  R. ,  &  Nitko,  A.  J.    Measurement  in  learning  and  instruction. 

In  R.  L.  Thomdike  (Ed.),  Educational  measurement.     (2nd  ed.) 
Washington:    American  Council  on  Education,  1971. 

Hambleton,  R.  K. ,  &  Eignor,'  D.  R.    Guidelines  for  evaluating  criterion- 
referenced  tests  and  test  manuals.    Journal  of  Educational 
Measurement,  1978,  15,  321-327. 

Hambleton,  R.  K. ,  &  Eignor,  D.  R.    Competency  test  development, 

validation,  and  standard-setting.     In  R.  Jaeger  &  C.  Tittle 
(Eds.),  Minimum  competency  testing.     (Approx.  Title)  Berkeley, 
California:    McCutchau  Publishing  Co.,  1979. 

Glaser,  R.    Instructional  technology  and  the  measurement  of  learning 
outcomes.    American  Psychologist,  1963,  18,  519-521. 

Glaser,  R.,  &  Nitko,  A.  J.    Measurement  in  learning  and  instruction. 

In  R.  L.  Thomdike  (Ed.),  Educational  measurement .     (2nd  ed.; 
Washington:    American  Council  on  Education,  1971. 

Hambleton,  R.  K. ,  &  Novick,  M.  R.    Toward  an  integration  of  theory  and 
method  for  criterion-referenced  tests.    Journal  of  Educational 
Measurement,  1973,  10,  159-170. 

Harris,  C.  W. ,  Alkin,  M.  C. ,  &  Popham,  W.  J.    Problems  in  criterion- 
referenced  measurement.    CSE  monograph  series  in  evaluation, 
No.  3.    Los  Angeles:    Center  for  the  Study  of  Evaluation, 
University  of  California,  1974. 

Millman,  J.     Criterion-referenced  measurement.     In  V7.  J.  Popham  (Ed.), 
Evaluation  in  educatJ™;    Current  applications.  Berkeley, 
California:    McCutchan  Publishing  Co.,  1974. 

33 


-27- 


Perloff,  R. ,  Perloff,  E.  ,  &  Sussna,  E.    Program  evaluation.  Annual 
Review  of  Psychology,  1976,  _27»  569-594. 

Plpho,  C.    Minimum  competency  testing  in  1978:    A  look  at  state 

standards.    Phi  Delta  Kappan,  1978,  59,  No.  9  (May),  585-587. 

Popham,  W.  J.    Criterion-referenced  measurement .    Englewood  Cliffs,  .NJ: 
Prentice-Hall,  1978.  (a) 

Popham,  W.  J.    The  case  for  criterion-referenced  measurements.  Educa- 
tional Researcher,  1978,  2»  6-10.  (b) 

Popham,  W.  J.,  &  Husek,  T.  R.     Implications  of  criterion-referenced 

measurement.    Journal  of  Educational  Measurement,  1969,  6.,  1-9. 

Traub,  R.  E.    Criterion-referenced  measurement:    Something  old  and 

something  new.    A  paper  prepared  for  an  invited  public  address 
at  the  University  of  Victoria,  1972. 


34 


Unit  2 


Preparation  of  Objectives  and  Test  Items 


Prepared  By 

Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

and 

Daniel  R.  Eignor 
Educational  Testing  Service 


March  15,  1979 


35 


Table  of  Contents 

Page 


2.0  Overview  of  the  Unit   1 

2.1  Introduction  -   2 

2.2  Development  ci  Objectives    5 

2.3  Domain  Specifications    10 

2.4  Examples  of  Domain  Specifications    22 

2.5  Item  Forms  Analysis   56 

2.6  Examples  of  Item  Forms  Analysis   61 

2.7  Flowchart  of  the  Process  of  Developing  Objectives  ....  68 

2.8  Objective  Banks   69 

2.9  Preparation  of  Test  Specifications   70 

2.10  Preparation  of  Test  Items  *   77 

2.11  Editing  Test  Items  •  96 

2.12  References   100 

2.12.1  References  Cited    100 

2.12.2  References  for  Further  Study    102 

2.12.3  Measurement  and  Evaluation  Textbooks    103 


36 


2.0    Overview  of  the  Unit 

This  unit  covers  the  first  four  steps  of  the  Criterion-Referenced 
Test  Development  and  Validation  Model  presented  in  Unit  1.    These  steps 
are : 

1.  Writing  and/or  Selection  of  Objectives 

2.  Preparation  of  Test -Specif icat ions 

3.  Writing  Test  Items  "Matched"  to  Objectives 

4.  Preliminary  Review  of  Test  Items 


I 


37 


-2- 


2.1.  Introduction 

In  Unit  2,  we  will  discuss  both  research  ar.d  procedures  directed 
toward  the  preparation  of  objectives  and  test  items  for  criterion- 
referenced  tests.    Before  offering  a  relevant  set  of  procedures  for  the 
development  of  objectives  and  items,  it  is  necessary  to  introduce  and 
discuss  some  important  background  information  that  will  help  the  reader 
better  understand  the  interrelationships  among  the  procedures  to  be 
discussed. 

?opham!s  (1978a)  definition  of  a  criterion-referenced  test  is  an 
excellent  starting  point  for  this  discussion.     It  is  as  follows: 


A  criterion-referenced  test  is  used  to  ascertain 
an  individual's  status  [referred  to  as  a  domain 
score]  with  respect  to  a  well-defined  behavior 
domain. 


Once  the  domain- of  relevant  behaviors  has  been  defined,  test  items  are 
written  to  measure  behaviors  in  the  domain,  and  a  test  is  formed.  From 

the  test  results,  a  test  practitioner  usually  desires  to  make  an  inference 

( 

about  an  examinee's  level  of  performance  relative  to  the  domain  of  be- 
haviors.   A  valid  inference  can  be  made  if  two  conditions1  are  met: 

1.  The  behavior  domain  is  clearly  and  completely  specified. 

2.  A  random  sample  (or  stratified  random  sample)  of  tasks 
from  the  domain  is  measured  by  the  test. 

In  attempting  to  achieve  these  two  conditions,  several  problems  arise . 

From  a  technical  standpoint,  for  a  random  sample  to  be  taken  (step  2), 

it  is  not  enough  for  the  domain  to  be  well-defined,  it  must  also  be 


••Actually,  if  the  test  items  have  been  calibrated  using  one  of  the 
many  latent  trait  models,  the  second  condition  need  not  be  satisfied 
(Hambleton,  1979). 


9 

ERIC 


38 


-3- 

completely  defined.    For  certain  subject  domains,  this  complete  defini- 
tion of  domain  may  be  impossible,  and  a  compromise  must  be  reached. 
Weaker  inferential  procedures  must  be  considered. 

This  issue  of  domain  specification  was  described  by  Traub  (1975) 
as  the  concern  for  domain  sampling  validity.    Domain  sampling  validity 
concerns  the  adequacy  of  the  tasks  contai  i  "  in  a  test  as  a  sample  of  the 
tasks  in  the  whole  domain.    According  to  Traub: 

It  is  the  kind  of  validity  that  establishes  the  basis 
for< one  kind  of  inference  from  observed  .level  of  per- 
formance on  a  test  to  probable  level  of  performance  on 
all  tasks  contained  in  the  domain. 

Traub  further  distinguishes  two  varieties  of  domain  sampling  validity, 

and  these  two  varieties  are  directly  related  to  the  procedures  discussed 

in  this  unit.    Strong  domain  sampling  validity  can  occur  when  the  domain 

is  explicitly  and  completely  defined;  weak  domain  sampling  validity  occurs 

when  the  domain  of  tasks  cannot  be  defined  explicitly,  and  so,  must  be 

defined  implicitly.    A  domain  of  tasks  is  explicit  if  all  the  tasks  in  the 

domain  are  known;  the  description  is  implicit  if  not  all  the. tasks  are 

known,  but  a  clear  enough* description  of  the  domain  exists  to  see  how  the 

tasks  should  arise.    The  most  frequently  used  approach  to  forming  explicit 

descriptions  is  to  employ  the  use  of  item  generation  forms  (the  procedure 

is  called  item  forms  analysis),  which  is  discussed  in  section  2.6. 

A  frequently  used  approach  based  upon  an  implicit  description  of  the 
domain  is  through  the  use  of  domain  specifications,  a  topic  which  will  be 
discussed  in  section  2.3. 

Strong  domain  sampling  validity  will  be  difficult  to  achieve, 
except  in  highly  structured  content  areas  such  as  mathematics. 


ERIC 


39 


-4- 

\ 

Usually  behavior  domains  cannot  be  explicitly  defined.    How  can 

weak  domain  sampling  validity  be  assessed,  in  these  situations?  Cronbaeh 

I 

(1971)  and  Cronbaeh  et  al.  (1972)  require  that  'if  a  domain  cannot  be 
explicitly  defined,  that  the  implicit  definition  be  clear  enough  so 
that  ".  .  .qualified  judges  can  agree  as  to  whether  any  particular  test 
item  is  included  in  the  definition  or  ruled  <^ut  by  it."  ,  To  be  able  to 
generalize  from  the  sample  of  test  items  to#  the  domain  when  the  domain  is 
implicitly  defined  requires  a  replication  of  the  item  writing  task  by 
different  content  specialists  and  a  comparison  of  examinee  scores  on  the 
two  forms  of  the  test  constructed  by  the  two  groups  of  content  special- 
ists working  independently  with  the  same  set  of  test  specifications. 
The  use  of  content  specialists  in  establishing  the  validity  of  test  items 
will  be  discussed  in  Unit  3. 

c 

In  summary,  the  test  constructor  wants  the  test  to  be  a  random 
sample  of  the  behaviors  included  in  a  domain  so  that  he/she  can  make  an 
inference  about  examinee  performance  on  the  domain,  based  on  the  -test  re- 
sults.   When  the  domain  can  be  explicitly  defined,  such  as  through  the  use 
of  item  generation  forms,  random  sampling  can  occur,  and  the  desired 
strong  inference    can  be  made.    When  the  domain  is .implicitfy  defined, 
such  as  through  the  use  of  domain  specification  procedures,  representative 
sampling  can  occur,  but  a  somewhat  more  limited  inference  ian  be  made. 
Replication  can,  however,  improve  the  strength  of  the  inference. 


-5- 

2.2    Development  of  Objectives 

The  purpose  of  this  section  is  not  to  debate  the  merits  of  using 
behavioral  objectives.    This  has  been  a  hot  topic  of  debate 
for  years  (Popham,  1968;  MacDonald  &  Wolf son,  1970;  Allendoerf er ,  1971; 
Forbes,  1971;  Gagne*,  1972;  Ebel,  1973;  Duchastel  &  Merrill,  1973;  Kneller, 
1972).    For  our  purposes  here,  little  could  be  accomplished  by  a  review 
of  the  literature.    We  will  offer  three  reasons  why  we  feel  behavioral  ob- 
jectives (or  some  variants)  should  be  used,  and  then  continue  to  the 
specifics  that  are  pertinent  to  criterion-referenced  test  development. 
Behavioral  objectives:     (1)  Serve  as  a  mechanism  for  organizing  a  curri- 
culum,  (2)  provide  information  to  students  about  what  is  expected  of  them, 
and  (3)  provide  a  basis  upon  which  to  assess  student  performance.    It  is 
the  third  reason  that  is  critical  in  the  development  of  criterion-referenced 
tests.    Behavioral  objectives  are  a  necessity  as  a  starting  point  fcr 
setting  up  a  criterion-referenced  testing  program.    However,  as.  the  dis- 
cussion that  follows  will  show,  behavioral  objectives  are  not  sufficient; 
more  specification  is  needed. 

Hopefully,  the  following  historical  development  will  provide  a  context 
in  which  to  view  the  present  state  of  usagejaf^ objectives  in  the  develop- 
ment of  criterion-referenced  tests.1    For  the  past  few  years,  it  has  been 
a  popular  procedure    for  criterion-referenced  test  developers  to  write 
their  objectives  in  "behavioral  terms."    However,  while  behavioral  ob- 
jectives have  some  desirable  features  (for  example,  they  are  relatively 
easy  to  produce),  they  often  lack  sufficient  clarity  to  permit  a  clear 
determination  of  the  domain  of  test  items  measuring  the  behaviors  intended 


1 Popham  (1978b)  offers  an  excellent  summary  of  the  development  and 
specification  of  objectives. 


ERIC  4 1 


» 


-6- 

to  be  defined  by  an  objective.    For  example,  for. even  a  simple  mathematics 
objective  such  an  adding  two  single-digit  numbers,  test  developers  would 
need  information  about  the  use  of  vertical  versus  horizontal  format,  the 
use  of  negative  numbers,  the  number  of  test  items  to  be  placed  on  each 
page,  etc.     If  the  proper  domain  of  test  items  measuring  an  objective 
is  not  clear,  it  is  impossible  to  select  a  representative  sample  of  test 
items  from  that  domain.    Since  it  is  desired  to  interpret  an  examinee's 
test  performance  on  the  sample  of  test  items  measuring  a  particular  objective 
as  an  estimate  of  that  examinee's  level  of  mastery  in  the  larger  domain  of 
items  measuring  an  objective,  it  is  essential  to  have  the  domain  of  test 
items  specified  clearly,  and  to  choose  a  representative  sample  of  test 
items.    When  the  domain  of  behaviors  is  not  clearly  spelled  out,  we  have 
what  Popham  (1974)  calls  a  "cloud-referenced  test." 
°  A  recent  advancement  that  has  proven  more  useful  than  the  behavioral 

objective  is  the  "amplified  objective"  (Popham,  1975).    According  to 
Millman  (1974),  "An  amplified  objective  is  an  expanded  statement  of  an 
educational  goal  which  provides  boundary  specifications  regarding  testing 
situations,  response  alternatives  and  criterion  of  correctness, 11    The  im- 
portance of  these  additional  guidelines  added  to  a  behavioral  objective 
is  that  they  help  to  define  the  relevant  domain  of  test  items.    There  is 
still  some  ambiguity  in  the  domain  definitions  but  the  situation  is  con- 
siderably improved  over  using  behavioral  objectives.    A  sample  behavioral 
objective  and  amplified  objective  are  presented  in  Figure  2.2.1. 

To  further  alleviate  the  ambiguity  mentioned  above  in  the  use  of 
amplified  objectives,  Popham  (1975)  presented  a  procedure  for  the  develop- 
ment of  domain  specifications.     Domain  specifications  can  be  viewed  as  a 
logical  extension  of  amplified  objectives,  where  there  is  also  careful 
O  clarification  of  the  content  specified  by  an  objective  besides  a 


St 


Figure  2.2.1.  An  Illustrative  IOX  Amplified  Objective  for  a 
Third-Grade  Level  Reading  Comprehension  Skvll1 

DETERMINING  SEQUENCE  FROM  TENSE  AND  WORDS  THAT  SIGNAL  ORDER 

Objective:      The  student  will  correctly  identify  the  sequence  of  three  sentences 
by  determining  order  from  tense  and  words  that  signify  order. 

Sample  Item; 

Directions.    Read  the. three  sentences.    Then  mark  an  "X"  next  to  the 
answer  that  arrange^  the  sentences  in  the. proper  order. 

Example:  A.     Once  there  were  only  candles  for  lighting  the  home. 

B.  Later,  there  were  dim  electric  lights. 

C.  Tesla  thought  of  a  way  to  make  the  electric  .l  ights  brighter, 
a)    A.C.B  b)    A,B,C   c)  C,A,B 


Amplified  Objective: 
Testing  Situation. 

1.  The  student  will  be  given  three  sentences  and  will  identify  their 
proper  sequence  on  the  basis  of  verb  tenses  and  signal  words. 

2.  Three  sentences  containing  signal  words,  and/or  changes  in  verb 
tense  will  be  provided. 

3.  Vocabulary  will  be  familiar  to  the  third  grader. 
Response  Alternatives. 

1.  Three  possible  orderings  of  the  sentence  will  be  given. 

2.  At  least  one  distractor  should  not  consist  of  a  random  ordering. 
It  should  maintain  the  first  event  as  first,  varying  only  the 
second  and  third  events. 

3.  The  other  distractor  may  be  any  other  incorrect  ordering  of  the 
events. 

Criterion  of  Correctness.  The  correct  answer  will  be  the  order  which  can 
be  determined  on  the  basis  of  one  of  the  following: 

1.  words  that  signify  sequence,  e.g.,  afterwards,  finally,  then, 
before,  during,  now,  next,  lastly,  later,  earlier,  meanwhile, 
long  ago,  once; 

2.  verb  tense  (future,  past,  present). 

~       Reproduced  from  Popham  (1978b,  permission  pending). 


-8- 

specif ication  of  boundary  conditions,  response  alternatives  and  criteria 
for  correctness.    In  section  2.3,  the  development  of  domain  specifications 
will  be  discussed  in  greater  detail. 

Besides  the  use  of  the  procedures  for  developing  domain  specifications 
being  used  vith  amplified  objectives,  similar  concerns  may  be  addressed 
with  item  forms  analysis,  or  the  development  of  item  generation  forms  (Hively 
et  al.,  1973).     In  this  situation,  two  requirements  need  to  be  met: 

1.  All  the  items  which  could  be  written  from  the  content  domain 
to  be  tested  must  be  written  (or  known)  in  advance  of  the 
final  selection  process,  ^ 

2.  A  random  or  stratified  random  sampling  procedure  must  be  used 
in  the  item  selection  process. 

Item  forms  analysis  assures  that  the  above  two  requirements  are  met*  An 

item  form  is  actually  a  process  having  the  following  characteristics  J 

1.  It  generates  items  with  a  fixed  syntactical  structure. 

2.  It  contains  one  or  more  variable  elements. 

3.  It  defines  a  class  of  item  sentences  by  specifying  the  replace- 
ment sets  for  the  variable'  elements. 

Before  discussing  domain  specifications  and  item  forms  in  greater 
detail,  two  comments  can  be  made.    One,  the  reader  should  link  the  above 
discussion  of  domain  specifications  and  item  forms  back  to  the  introduction, 
and  the  distinction  set  up  between  strong  and  weak  domain  sampling  validity. 
Two,  as  pointed  out  by  Hively  et  al  (1973),  the  use  of  item  forms  analysis 
is  relevant  only  for  highly  structured  subject  areas  such  as  mathematics. 
Pophamfs  domain  specification  procedure  appears  to  be  relevant  for  a  much 
wider  variety  of  subject  areas,  and  hence,  in  the  sections  to  follow,  more 
emphasis  will  be  placed  on  domain  specifications. 

It  should  be  noted  that  there  appear  to  be  several  other  promising 
ways  for  defining  a  "behavior  domain/1    Besides  Mdomain  specif ications" 

erJc  44 


* 


-9- 

and  "item  forma  analysis,"  facet  theory  (Berk,  1978),  item  transformations 
/  (Anderson,  1972;  Bor.auth,  1970)  and  algorithms  (Scanduru,  1977)  have  been 

suggested.    However,  these  last  three  methods  will  not  be  considered  fur- 
ther here.    The  cited  references  above  will  guide  the  interested  reader 
to  further  study  of  these  important  new  developments • 


o  45 

ERIC 


-10- 

2.3    Domain  Specifications 

Popham  (1975,  1978a)  has  prepared  a  series  of  steps  that  allow  the 
test  developer  to  produce  a  domain  specification.    According  to  Popham 
(1978a): 

The  most  important  attribute  of  a  criterion-referenced 
test  is  that  it  provides  a  clear  description  of  the 
class  of  behavior  that  the  examinee  can  or  cannot  per- 
form.    In  fact,  this  description  of  measured  behavior 
constitutes  the  "criterion"  to  which  the  test  is  "re- 
ferenced. 11 

The  steps  Popham  has  developed  help  in  describing  the  class  of  behaviors 
discussed  above  in  the  quote.    According  to  Popham,  a  domain  specification 
should  have  two  desirable  qualities  when  prepared: 

1,  The  specification  should  be  brief  enough  to  be  used 
by  the  developer,  and. at  the  same  time, 

2,  the  specification  serving  as  the  domain  description 
should  be  stated  so  that  it  "sufficiently  circum- 
scribes the  class  of  behaviors  under  consideration  so 
that  independent  judges  will  register  Mgh  agreement 
regarding  whether  particular  test  items  do,  in  fact, 
measure  the  behavior  described  in  that  domain." 

The  following  series  of  steps  are  those  suggested  by  Popham  (1975, 
1978a).    We  have  included  together  suggestions  he  has  made  in  both  of  his 
books  to  arrive  at  an  "all-encompassing"  set  of  steps  one  might  follow 
to  produce  a  domain  specification.    Before  presenting  an  outline  of 
these  steps  and  a  further  discussion  of  the  relevance  of  each,  two  comments 
should  be  made.    One,  the  domain  specifications  developed  through  utiliza- 
tion of  Popham' s  procedure  lead  to  a  situation  that  Traub  has  described 
as  having  weak  domain  sampling  validity.    Only  when  the  stimulus  attributes 
(step  3c  in  Popham1 s  procedure)  can  be  described  in  totality  can  strong 
domain  sampling  validity  be  obtained.     Second,  in  what  follows,  steps 
one  and  two  are  concerned  with  general  considerations  one  must  attend  to 

46 


before  lIw    > mal  domiin  specification  can  be  prepared,  which  are  described 
in  step  3.    With  thi .       mind,  an  outline  of  Popham's  ,i)75f  1978a) 
steps  for  the  preparation  of  a  domain  jpi  ^:   ...-at... .  i  xa  <n>  u^ivrfat 

1.  Zeroing  in  on  the  behavior  to  be  measured:    Degree  of  generality 

a.  instructional  duration 

b.  limited  priorities 

c.  item  homogeneity 

2.  "'Selecting  from  competing  domain  alternatives 

a.  transferability  within  domain  alternatives 

b.  transferability  outside  the  domain 

3/    Component  steps  in  actual  domain  specification  preparation 

a.  general  description 

b.  sample  item 

c.  stimulus  attributes 

d.  response  attributes 

e.  specification  supplement 

1.    Zeroing  in  on  the  behavior  to  be  measured:    Degree  of  generality 

The  question  that  must  be  first  answered  in  the  ultimate  quest  for  a 
clear  domain  specification  is,  according  to  Popham  (197 5),  "Hew  large  a 
chunk  of  an  individual's  behavior  should  we  set  out  to  circumscribe?"  There 
is  a  trade-off  here  between  choosing  a  large  area  of  behavior,  thereby 
forcing  minute  detail  in  order  to  adequately  specify  the  domain,  and 
smaller  areas  of  behavior  which  collectively  may  bring  about  more  domain  de- 
scriptions than  people  will  ever  use.    Popham  (1978a)  argues  for  choosing  smaller 
segments  of  examinees  behavior  because  the  more  finite  behaviors  are  easier 


-12- 

to  isolate  and  circumscribe.    Ho  offers  three  possible  ways  to  aid  in 
the  choice  of  degree  of  generality:    Instructional  duration,  limited 
priorities,  and  item  homogeneity.    These  are  offered  as  guidelines  for  choice, 
and  can  be  used  separately  or  collectively.    In  reference  to  instructional 
duration,  Popham  (1975)  states: 

One  way  of  thinking  about  a  domain  is  to  consider 
the  amount  of  instructional  time  it  would 
typically  take  to  get  learners  to  display  the 
behavior  depicted  in  the  domain  description. 

In  other  words,  the  size  of  the  domain  to  be  circumscribed  can  be  dictated 
by  how  long  it  takes  to  instruct  students  in  the  domain.    In  reference  to  limiting 
priorities,  (the  second  guideline)  one  can  get  a  "handle"  on  domain  magnitude  by 
setting  a  limit  on  the  number  of  domains  to  be  used,  and  then  making  sure  that  the 
most  important  behaviois  to  be  sought  are  incorporated  in  this  group  of 
limited  domains.    Finally,  the  domain  generality  issue  can  be  resolved  by 
setting  as  a  limit  the  domain  description  that  would  yield  one  variety  of 
homogeneous  item.    That  is,  the  domain  to  be  circumscribed  can  be  only 
large  enough  that  the  items  generated  from  the  domain  perform  the  same 
function.    Sameness  of  function  can  be  observed  by  looking  at  the  similarity 
between  items  in  content  and  format. 

Suppose  that  by  using  one  of  the  suggested  guidelines,  or  some 
other  practical  method,  the  test  developer  has  come  up  with  that  he/she 
f.eels  is  a  suitably  sized  segment  of  behavior  to  be  measured,  a  segment 
that  can  be  adequately  "circumscribed.*'    In  other  words,  the  test  developer 
has  come  to  closure  on  the  degree  of  generality  of  the  behavior  he/she  is 
going  to  try  to  measure.    The  next  problem  is  deciding  on  which  of  a  number 
of  measurement  approaches  to  use  to  try  to  measure  the  domain. 


48 


ERIC 


-13- 

2.  Selecting  from  competing    domain  measurement  alternatives 

The  problem  to  be  addressed  at  this  point  is  to  decide  upon  one 
from  a  number  of  measurement  approaches  possible  to  assess  the  behavior 
under  consideration.    For  example,  if  one  is  trying  to  assess  a  student's 
ability  to  add,  there  are  numerous  measurement  possibilities:  Numbers 
listed  vertically  (+2  ) ,  numbers  listed  horizontally  (4  +  2) ,  numbers  in 
equation  form  (4  +  2  ■  x) ,  or  possibly  in  verbal  form  (Joe  has  4  oranges, 
Anne  gives  him  2  more,  how  many  does  he  have?).    There  are  more  possibilities 
for  this  very  simple  task;  for  complex  tasks,  the  possibilities  that  can  be 
written  down  are  even  less  exhaustive.    Popham  (1975)  warns  that  while  it  is 
enticing  to  combine  all  (or  many)  of  the  possibilities  in  a  single  domain, 
this  would  cause  confusion  as  to  what  constituted  the  domain  in  the  first 
place.    One  measurement  procedure  must  be  chosen  and  Popham  offers  that  the 
measurement  alternative  chosen  should  be  the  most  generalizable  of  the 
possibilities  considered  and  also  be  the  one  most  able  to  be  transferred  out- 
side the  particular  domain  to  others.    In  reference  to  generalizability  across 
alternatives,  another  way  of  looking  at  this  would  be  to  select  the  alternative 
that,  when  mastered,  would  be  most  likely  to  reflect  mastery  of  the  other 
possibilities.     In  reference  to  degree  of  transfer,  the  alternative  that 
transfers,  the  most  to  other  skills,  courses,  etc.,  should  be  choso.n.  Once 
again,  these  are  only  potential    guidelines  to  be  used,  there  are  no  hard- 
and-fast  procedures  for  selecting  the  measurement  alternatives. 

In  sum,  by  selecting  from  the  myriad  of  possible  ways  of  measuring 
behavior  the  one  form  of  measured  behavior  that  is  the  most  generalizable,  we 
will  have  (Popham,  1975): 

...the  best  of  two  worlds,  that  is,  an  adequate 
reflection  of  the  attribute  being  assessed  plus 
an  understandable  set  of  test  results. 


9 

ERIC 


49 


9 

ERIC 


\ 

-14- 

Thus  far,  the  test  developer  has  decided  upon  the  degree  of  gen- 
erality of  the  domain  he/she  wants  to  deal  with,  and  has  further  chosen 
a  particular  assessment  approach  from  a  larger  group.    The  next  step  is 
to  actually  generate  the  domain  specifications.  \ 

s 

3,    Component  steps  in  domain  specification  preparations 

Popham  (1975)  makes  the  point  here  that  differing  approaches  to 
the  describing  of  a  relevant  domain  vary  in  degree  of  detail.    He  offers 
a  set  of  procedures  that  have  been  modified  through  use  with  the  Instructional 
Objectives  Exchange.    Popham  recognizes  that  they  are  far  less 
"elaborate"  than  the- procedures  developed  by  Hively,  et  al.  (1973)  for 
describing  a  domain  of  behavior.    The  utility  of  Popham' s  approach,  we 
feel,  is  generated  out  of  being  able  to  use  his  set  of  steps  over  a  wide 
variety  of  subject  domains,  especially  those  that  are  less  structured 

(i.e.,  humanities). 

The  first  component  of  a  domain  specification  to  be  prepared 
is  a  general  description  of  exactly  what  the  test  purports  to  measure. 
This  provides  an  overview  of  the  behavior  (or  set  of  behaviors)  that  are 
described  in  detail  later.    While  this  component  could  be  suitably  called  an  objec- 
tive, Popham  prefers  to  call  it  a  general  description.    In  his  1978  book,  he  offers 

the  following  example  of  a  general  description  for  a  CRT  dealing  with  the  scientific 
method. 

When  given  brief,  previously  unseen  fictitious 
accounts  of  the  research  activities  of  natural 
and  physical  scientists,  students  will  answer 
questions  (keyed  to  the  accounts)  calling  for 
the  identification  of  particular  phases  of  the 
scientific  method  being  illustrated. 

As  a  note  for  the  test  developer,  as  he/she  proceeds  though  the 
later  steps,  specification  of  stimulus  and  response  attributes,  this 
general  description  may  have  to  be  reworked  several  times. 

50 


The  second  component  is  the  specification  of  a  sample  test  item, 
including  il.e  directions  to  the  student  about  how  to  respond.    Popham  (1978a)  feels 
there  are    two       reasons  for  providing  a  sample  item  a    the  second  component: 
It  serves  as  an  illustration  for  individuals  unable  lo  read  the  detailed  descrip- 
tions (because  of  time  involved)  and,  more  importantly,  it  provides  format  cues 
for  the  item  writers.     It  specifies  the  preferred  form  in  which  the  items  can  be 
constructed.    Popham  (1978a)  also  suggests  that  the  correct  answer  not  be 
identified  for  the  sample  item;  the  complete  specification  should  be  con- 
sidered before  the  reader  can  adequately  assess  degree  of  correctness  of 
the  sample  item. 

The  third  step  in  the  development  of  a  domain  specification  by  far 

the  most  difficult;  here  the  attributes  of  the  stimulus  materials  are 

specified,  along  with  delimitations  on  possible  stimuli.    In  other  words, 

there  must  be  an  extensive,  i.e.,  as  complete  as  possible,  description  of  what 

stimuli  can  constitute    a  test  item.    According  to  Popham  (1978a) : 

In  the  stimulus  attributes  section  of  the  test 
specifications  we  must  set  down  all  the  really 
influential  factors  that  constrain  the  composition 
of  a  set  of  test  items. 

Further,  Popham  (1978a)  notes: 

The  general  rule  is  that  the  test  specifier  has  to 
spell  out  all  of  the  critical  and  controlling 
dimensions  which  will  permit  someone  to  create  a 
set  of  test  items  that  will,  without  exception, 
be  viewed  as  congruent  with  the  constraints  set 
forth  in  the  specifications. 

For  tests  that  depend  on  subject  matter  content,  which  most  criterion-referenced 
tests  do  (although  one  could  envision  CRTs    in  the  affective 
domain),    Popham  suggests  that  one  of  three  techniques  be  utilized  in 
specifying  content  that  can  be  used  for  stimuli  (in  the  test  item).  One 
technique  is  to  spell  out  rules  or  algorithms  which  are  used  in  generating 
and  delimiting  the  content.     Item  generation  rules  or  item  forms  analysis 


51 


-16- 

(Hively  et_al . ,  1973)  is  an  example  of  such  an  approach.    As  mentioned 
before,  we  see  this  as  being  an  "idealized"  situation  likely  to  occur  for 
only  structured  subject  areas.    However,  if  item  forms  analysis  is  possible, 
it  is  at  this  point  that  Popham's  work  and  the  work  of  Hively  and  his  as- 

t 

sociates  unites .^eA  second  technique ' for  delimiting  and  specifying  content 
is  to  list  all  the  content  that  might  be  included.    This  would  seem  to 
hold  relevance  for  ever,  fewer  situations  than  the  first  technique.  The 
final  technique,  which  holds  the  most  relevance  for  situations  the  test 
practitioner  is  likely  to  encounter,  is  to  try  as  carefully  as  possible  to 
isolate  and  describe  the  defining  attributes  of  all  eligible  content  for 
the  tesuPopham  says  that  if  even  this  isn't  possible,  some  examples  of 
acceptable  and  unacceptable  content  is  better  than  nothing. 

In  consideration  about  what  does  and  what  does  not  constitute 
relevant  stimulus  attributes,  Popbam  suggests  always  using  the  following 
reminder : 

What  are  the  absolutely  indispensable  elements  that 
items  writers  must  consider  in  producing  test  items? 

Further,  in  deciding  about  what  rules  for  inclusion  or  exclusion  of  stimuli 

need  to  be  specified,  preparation  of  some  trial  items  should  be  helpful  in 

making  decisions. 

In  sum,  the  specification  of  stimulus  attributes  constitutes  the 
most  critical  step  in  the  development  of  a  domain  specification.  The  rules 
for  generating  the  items  for  the  domain,  and  ultimately  the  test  itself, must 
be  defined  here.    Therefore,  the  stimulus  attributes  section  should  both  specify 
the  content  on  which  the  items  are  generated  and  also  describe  the  "directions 
to  respond"  that  the  student  is  to  receive. 


52 


-17- 

The  fourth  step,  specification  of  the  response  attributes,  is  a 
little    easier  than  specification  of  the  stimulus  attributes.    This  is 
because  only  two  possible  types  of  responses  can  be  made  by  the  examinee. 
He/she  can  either  select  from  a  set  of  response  options  for  a  test  question 
or  he/she  can  construct  a  response.    This  section  should  specify  the  rules/ 
criteria    upon  which  both  sorts  (if  both  are  used\  or  a  single  sort  of 
response  type  is  to  be  treated. 

If   an  examinee      has  to  select  a  response,  rules  must  be  provided 
for  determining  the  nature  of  the  correct  and  the  incorrect  responses.    In  othe 
words,  when  given  to  an  item  writer,  the  writer  should  be  able  to  generate 
the  correct  response  and  the  incorrect  response (s)  directly  from  the  response 
attribute  section.     Pophara  (1978a)  suggests  that  identification  of  wrong 
answer  options  for  a  test  item  can  usually  come  about  by  considering  the 
various  ways  in  which  the  examinee  "goes    wrong, 11 

If    the  respondent  is  asked  to  construct  his/her  own  response,  the 
task  o /  specifying  response  attributes  becomes  even  more  difficult •  The 
test  developer  must  try  to  explicate  the  criteria  that  should  be  used  to 
judge  how  adequate  an  examinee's  response  is.    Popham  again  suggests  that 
creation  of  some  sample  trial  responses  should  help  in  this  process.  Finally, 
in  reference  to  constructed  responses,  Popham  warns  of  the  use  of  ''hedging 
phrases."    He  uses  examples  such  as  "responses  must  be  appropriate  to  the 
context  of  the  stimulus"  or  "answers  should  be  reasonable  outgrowths  of  the 
materials  provided."    Without  further  defining  "appropriate"  and  "reasonable", 
such  explications  of  response  attributes  has  accomplished  nothing.  The 
explication    must  be  specific,  so  that, based  upon  the  response  attributes, 
one  can  ascertain  the  "fit"  of  the  student's  response  to  the  specifications. 

The  specification  supplement  Is  just  that;  a  supplement  that  might 
contain  information  on  the  stimulus  attributes  and/or  response  attributes 

53 


-18- 

section(s)  that  would  have  made  the  respective  sections  too  long.     In  other 
words,  contained  here  would  be,  for^TnTtanc^  content  listings  that  should 
be  listed  some  place,  but  aren't  critical-  for  the  stimulus  attributes 
section. 

Following  through  these  series  of  steps  should  help  the  test  developer 

* 

come  up  with  a  quite  well-developed  set  of  domain  specifications.  Unlike 

J 

the  situation  when  alternate  approaches    such  as  item  forms  analysis  are 
used,  one  can  never  be  certain  that  all  sources  of  ambiguity  have  been 
removed  from  a  domain  specification.     However,  it  should  be  quite  clear, 
to  the  reader  at  this  point  that  the  procedure  just  described  is  relevant 
for  a  wide  variety  of  subject  domains,  and  from  a  practical  point  of  view, 
one  should  be 'able  to  live  with  the  ambiguity  because  of  the  procedure's 

great  flexibility. 

No  matter  how  much  care  is  taken  in  preparing  domain  specifications, 
they  must  be  carefully  reviewed  by  other  content  specialists  and  individuals., 
who  will  use  them  (for  example,  item  writers  and  teachers).  -On  the  next  . 
two  pages  is  a  draft  copy  of  a  domain  specification  review  form.     It  is 
assumed  irv  the  review  form  that(a  domain  specification  is  divided  into 
four  sections:     (1)  Skill,  (2)  Sample  Directions  and  Test  Items,   (3)  Con- 
tent Domain,  and  (4)  Characteristics  of  Answer  Choices  and  Scoring.  (Clearly, 
the  four  sections  correspond  to  those  proposed  by  Popham  but  we  have  used 
new  section  labels  to  facilitate  communication  with  domain  specification 
writers  ) 

p 

Oar  usual  procedure  is  to  separate  domain  specification  twriters  into 
work  groups  of  three  or  four.     Their  task  is  to  produce  draft  copies  of 
domain  specifications.     (The  overall  content  of  each  domain  specification 
is  specified  in  a  "test  blueprint"  or  "content  guide"  which* must  be 


-19- 


prepared  and  approved  before  the  writing  of  any  domain  specifications 
can  begin.)    The  draft  copies  are  then  critiqued  by  at  least  one  other 
work  group  (and  more  if  there  is  time  and  money  to  do  so)  using  the 
review  form  to  guide  the  direction  of  the  critique.    Once  a  domain  speci 
fixation  is  reviewed,  the  writing  group  and  the  review  group(s)  meet 
to  discuss  the  review  group's  critique.    Following  this  meeting,  appro- 
priate revisions  can  be  made.    It  is  usually  desirable  to  have  the 
revised  domain  specifications  reviewed  again,  sometimes  by  a  larger 
and  more  diverse  group  of  individuals. 


55 


-20- 


January  12,  1979 
(Third  Draft) 


Domain  Specification 
Review  Form 


Domain  Specification: 


Reviewer : 


Date: 


Please  rea-'  the  domain  specification  carefully.    Next  answer  the  eight  questions 
below.     Staple  (or  clip)  your  copy  of  the  domain  specification  to  this  review 
form  after  you  have  answered  the  questions. 

Thank  you  for  your  comments. 


SKILL 

1    Does  the  "skill  section"  provide  sufficient  details  to  give  a  reader  an 

indication  of  the  behaviors  defined  by  the  domain  specifications?  (Circle  one) 

(a)  Yes 

(b)  Yes,  with  reservations 
Please  explain! 


(c)  No 

2.  How  might  the  "skill  section"  be  revised  to  improve  its  clarity? 


Sample  Directions  and  Test  Item 

3.  Can  the  test  directions  be  revised  to  improve  their  clarity?  (Circle  One) 

(a)  Yes,   (please  write  your  comments  on  the  domain  specification) 

(b)  No,  they  are  clearly  written 


56 


-21- 


4.  Do  you  feel  the  item  format  is  the  best  to  ensure  that  examinee  answers 
will  measure  the  behaviors  defined  by  the  general  description?  (Circle  One) 

(a)  Yes 

(b)  Yes,  with  reservations 
Please  explain: 


(c)  No.  Which  item  type  would  be  best? 


5.  Will  the  sample  test  item  provide  a  "good  model"  for  item  writers  pre- 
paring items  to  measure  the  domain  specification?      (For  example,  does 
the  item  include  the  desired  number  of  answer  choices,  and  is  the 
vocabulary  appropriate  for  the  intended  group  of  examinees?)  (Circle  One) 


(a)  Yes 

(b)  Yes,  with  reservations 
Please  explain: 


(c)  No 


CONTENT  DOMAIN 

6.  Do  you  have  any  suggestions  for  revising  and/or  extending  the  content 
defined  by  the  domain  specification?    Please  write  comments  on  your 
copy  of  the  domain  specification  (Your  suggestions  could  include: 
(a)  deletions  of  specific  content,  (b)  additions  to  the  content,  (c) 
rewrites  for  clarification). 


CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

7.  Do  you  have  any  suggestions  for  revising  and/or  extending  the 
characteristic;-  of  possible  answers?    Please    write  comments  on 
your  copy  of  the  domain  specification. 


8.   (For  essay  questions  only.)     Do  you  have  any  suggestions  for  improving 
the  scoring  of  test  items?    Please  write  comments  on  your  copy  of  the 
domain  specification. 


57 


-22- 

2-4    Examples  of  Domain  Specifications 

The  examples  of  domain  specifications  offered  in  this  section  come 
from  many  sources.    The  first  two  examples  are  from  Popham  (1978a) 
and  are  included  here  with  his  permission.    They  are  direct  applications 
of  the  steps  discussed  in  section  2.3.    The  third  example  was  prepared 
by  Millman  and  Craig  (1977)  and  reproduced  here  with  the  senior  author's 
permission.    Although  the  domain  specification  is  organized  in  a  fashion 
different  from  the  plan  offered  in  the  last  section,  it  is  an  excellent 
example  of  a  domain  specification.    The  next  set  of  examples  were  prepared 
by  Jerry  George  and  his  staff  at  the  Glendale  Union  High  School  District 
in  Arizona.    They  are  included  in  our  materials  with  their  permission. 
These  sample  domain  specifications  are  only  a  few  of  more  than  a  hundred 
they  have  prepared  in  the  last  two  years  in  four  sulject  areas.  Again, 
the  format  of  the  domain  specifications  is  different  from  Popham1 s  most 
recent  recommendations  advanced  in  section  2.3,  but  clearly  the  chosen 
format  represents  an  attempt  by  the  authors  to  clarify  the  relevant  domain 
of  behaviors  defined  by  the  objectives  defining  their  high  school  curricula. 
(They  prefer  the  terra  "criterion-referenced  test  model"  to  "domain  speci- 
fication."   The  first  term  was  introduced  by  Popham  a  number  of  years  ago.) 
The  remaining  seven  examples  are  second  or  third  drafts  of  domain  speci- 
fications prepared  at  workshops  offered  by  the  authors  of  this  Practitioners 
Guidebook.    The  domain  specifications  are  from  several  content  areas  and 
presented  in  several  different  formats. 


-23- 


Exampie  1  (Popham,  1978,  pp.  129-131) 


An  illustrative  sot  of  criterion-referenced  test  specifications; 
applying  concepts  of  United  States  foreign  policy 

General  description 

Given  a  description  of  a  fictitious  international  situation  in  which  the 
United  Stales  may  wish  to  act,  and  thf  name  of  an  American  foreign  policy 
document  or  pronouncement,  the  students  will  select  from  a  list  of  alter- 
natives the  emu  so  of  action  that  wouUl  most  likely  follow  from  the  given 
document  or  pronouncement. 

Sample  item 

Directions:  Read  each  fictitious  example  below.  Decide  what  aclion  the 
United  States  would  most  likely  take  based  on  the  given  foreign  policy 
document.  Write  the  letter  of  the  action  on  your  answer  sheet. 

Some  Russian  agents  have  become  members  of  the  Christian  Demo- 
cratic Party  in  Chile.  The  party  attacked  the  presidents  huusc  and 
arrested  him.  The  Russian  agents  set  themselves  up  as  president  tuid 
vice-president  of  Chile.  Chile  then  asked  to  become  an  "affiliated  re- 
public" of  the  ussn. 

Based  on  the  Monroe  Doctrine,  what  will  the  United  Slaics  do? 

a.  Ignore  the  new  status  of  Chile. 

b.  Warn  Russia  that  its  influence  is  to  be  withdrawn  froir  Chile. 

c.  Refuse  to  recognize  the  new  government  of  Chile  because  it  came 
to  power  illegally. 

(I  Send  arms  to  all  groups  in  the  country  that  swear  to  oppose  com- 
munism. 

Stimulus  attributes 

L  The  firiihous  p;iss»iKe  will  consist  of  500  words  or  less  followed 
by  the  name  of  a  foreign  policy  pronouncement  or  document  in- 
serted into  the  question,  "Based  on  the  -  — ,  what  will  the 

United  States  do? 

2.  The  policy  named  in  the  stimulus  passage  will  be  a  document  or 
prcnounccmcnt  selected  from  the  specification  supplement. 

3.  Each  passage  will  consist  of  two  parts:  (a)  a  background  do 


ERLC 


59 


-24- 


,,ipM(M1  0,  ilM  oc.hh.  taken  by  n  nation  and  (h)  « M*Oo- 

,„ci!i  «,f  the  action  to  which  the  forei*'  document  01  pro- 

nouuecmrnt  is  In  In1  applied,  .  .      .  i 

„  Tin-  bukgrnund  statement  will  he  analogous  to  an  historical 
situation  iba!  e.lhci  preceded  the  issuance  of  ihc  cited  document 
ur  pronoimcemcui  or  lor  which  the.  document  or  pronouncement 
was  used.  For  example,  the  Monroe  Doctrine  was  drawn  up  in 
response  to  Ktimpeoii  designs  on  American  nalions  that  were 
silh-mptiiiK  lo  eslahlish  independence,  An  analogous  ease  today 
might  describe  a  European  country  ntlPinpliiig  to  encroach  on  Ihc 
sovereignty  of  an  American  country. 

The  statement  of  an  action  will  describe  an  action  taken  by  a 
real  foreign  nation  tliut  conforms  to  one  of  the  following  cate- 
gories: 

( / )  Initiation  of  an  international  conflict. 
(2)  Initiation  of  a  civil  conflict.  This  may  include  coups,  rev- 
olutions, riots,  protest  marches,  civil  war.  or  a  parliamentary 

CI  INN. 

l.'i*  Initiation'  of  an  international  relationship.  Tin's  may  in- 
clude trade  negotiations,  friendship  pacts,  military  alliances, 
and  all  classes  of  treaties. 

(./)  Appeal  for  foreign  aid  to  meet  economic  or  military 
needs. 

(.5)  Development  and  stockpiling  of  military  weapons. 

4.  All  statements  in  the  passage  will  refer  to  specific  nations  and 
events.  Descriptions  such  as,  "A  nation  is  at  war  with  another  coun- 
try.1' are  not  acceptable. 

5.  When  the  document  or  pronouncement  mentioned  in  the  stimulus 
passage  is  tied  to  a  particular  geographical  region,  countries  named 
in  the  passage  must  belong  to  that  region. 

6.  Passages  will  he  written  at  no  higher  than  the  seventh-grade 
reading  level 

Response  attributes 

1.  Students  will  be  asked  to  mark  the  letter  of  one  of  four  given 
response  alternatives  consisting  of  the  correct  response  and  three 
distractors,  Each  alte  rnative  will  possess  the  following  characteris- 
tics: 

a.  Describe  a  specific  course  of  action  that  refers  to  the  people, 
nations,  and  actions  in  the  stimulus  passage. 
h.  He  brief  phrases  written  to  complete  the  understood  subject, 
"The  United  States  will  ,  .  ," 

2.  Distractors  (wrong  answers)  will  be  written  to  meet  these  addi- 
tional criteria: 

</.  Each  distractor  u  ill  describe  an  action  derived  from  a  different 
document  or  pronouncement  selected  from  the  specification 
supplement. 


60 


9 

ERIC 


-25- 


h.  Documents  or  pronouncements  from  winch  identical  com  so  of 
action  may  be  derived  will  not  ho  used. 

t\  The  decision  for  the  United  States  not  to  net  may  be  used  as 
a  course  of  action  when  it  is  based  on  a  document  or  pronounce- 
ment. 

3.  The  eonect  response  will  he  the  coui>e  of  acti  >u  th.it  in  w.'Wi'u  d 
by  live  principles  de^ciilvd  in  iho  document  or  pronemueir.ent 
named  in  tlv  stimulus  passac*. 

Specification  supplement:  r/fci/i.V  policj  l\vu»mi/.v  am!  renouncement* 

The  following  list  of  foreign  policy  pronouncements  and  documents  was 
selected  from  Thomas  Hrockway,  Music  Documents  in  United  States 
ftwteii  Policy  (Princeton,  N.J.,  D.  Van  Nostraud,  1968),  The  items 
selected  were  chosen  on  the  basis  of  their  generali/ability  and  potential 
application  to  current  events.  The  list  appears  in  chronological  rrder. 

1.  The  Declaration  of  Independence 

2.  Washington's  Farewell  Address 

3.  The  Monroe  Doctrine 

4.  Webster  on  Revolutions  Abroad 

5.  Open  Door  in  China 

6  The  Piatt  Amendment 

7.  Roosevelt  Corolla]  y  of  the  Monroe  Doctrine 

8.  The  Fourteen  Points 

9.  The  Washington  Conference 
10.  The  Japanese  Exclusion  Act 
IK  The  KellogK-Rriand  Pact 

12.  The  Stimsou  Doctrine 

13.  Roosevelt's  Quarantine  Speech 

14.  The  Atlantic  Charter 

15.  The  Connally  Resolution 
16*  The  Yalla  Agreements 

17.  The  Potsdam  Agreement 

18.  United  States  Proposals  for  the  International  Control  of  Atomic 
Power 

19.  The  Truman  Doctrine 

20.  The  Marshall  Plan 

21.  The  Point  Four  Program 

22.  The  North  Atlantic  Treaty 

23.  American-Japanese  Defense  Pact 

24.  Atoms  for  Peace:  Eisenhower's  Proposal  to  the  United  Nations 

25.  The  Formosa  Resolution 
2(5.  The  Eisenhower  Doctrine 

27.  Alliance  for  Progress 

28.  Kennedy  s  Grand  Design 

29.  Treaty  on  the  Peaceful  Uses  of  Outer  Space 


61  SEE? 


Example  2  (Popham,  1978,  pp.  132-134) 


-26- 

BSST  EST7  r.'JTJL'SlE ' 


An  illustrative  set  of  criterion-referenced  test  specifications: 
job  interview  procedures 

General,  description 

H-,vin«  road  a  description  of  «  job  interview  in  which  the  applicant  may 
^vend  specified  types  of  errors  in  appearance,  conduct,  or 
will  select  the  error  made  or  indicate  that  no  error 

was  made. 

Directions.,  Head  tho  description  of  each  job  interview  below.  If  the 
applicant  makes  tin  error  in  interview  behavior,  mark  the  letter  of  tho 
response  alternative  that  matches  the  error  lcscribed.  If  no  error  was 
made,  nuuk  'V." 

Anita  arrives  Ri-e  minutes  early  for  an  interview  for  a  trainee  job  in 
floral  design  and  sales.  She  wears  a  white  dress  with  long,  full  sleeves 
and  shoes  with  high  heels.  She  brings  a  portfolio  of  her  work  as  a 
design  major  in  high  school  and  briefly  points  out  the  designs  she 
feels  are  most  closely  related  to  Horistry.  She  answers  the  inter- 
viewer's questions  in  a  brief,  courteous  manner  and  indicates  her 
willingness  to  perform  all  aspects  of  the  florists'  trade,  including 
scrubbing  floors,  washing  buckets,  and  disposing  of  spoiled  flowers. 
What  is  Anita's  error? 

a.  lack  of  punctuality 

b.  inappropj  iate  dress 

c.  irrelevant  materials  presented 

d.  inappropriate  attitude 
c.  no  error  was  made 

Stimulus  attributes 

1.  Kaeh  item  will  consist  of  a  fictitious  description  of  100  words 
or  less  'Valing  with  a  named  persons  job  interview,  followed 
by  that  person's  name  inserted  into  the  question,  "What  is  

 \s  error?*' 

2.  The  description  will  include  the  type  of  job  being  applied  for  and 
illustrations  of  at  least  four  of  the  following  behavioral  factors  that 
may  influence  an  impression  of  an  applicant: 

a.  Punctuality-arrival  at  or  within  a  reasonable  time  before  the 
specified  interview  time.  Arrival  after  the  specified  time,  or  arrival 
more  than  %  hour  early  will  be  considered  lack  of  punctuality,  as 
both  may  inconvenience  the  interviewer. 


62 


ERIC 


-27- 


BE8T  K7  ElME 


b,  Appropriateness  of  dress— dn\ss  winch  is  neat,  dean,  and 
practical  for  the  typo  of  job  being  applied  lm\  11  one  expects  ilr.it 
an  interview  may  include  a  demonstration  of  skills,  one's  clothing 
most  not  interfere  with  such  a  demonstration  Extremes  such  as 
very  high  heels,  low  cut  dresses,  very  light  pants,  etc.,  are  almost 
always  inappropriate.  Appropriateness  of  dress  also  includes  such 
personal  grooming  items  as  length  of  fingernails,  length  and  style 
of  hair,  etc,,  which  are  inappropriate  only  if  they  arc  likely  to 
interfere  with  the  work  involved  in  the  job  being  applied  for 
(e»g.,  long  fingernails  on  a  secretarial  applicant). 

c.  General  courtesy— pleasantness  and  politeness  to  all  individuals 
encountered  before,  during,  and  alter  the  interview. 

(I  Frankness-honesty  and  directness  in  answer  to  personal  or 
experience-related  questions,  False  answers,  misleading  answers, 
attempts  to  .change  the  subject,  or  attempts  to  rationalize  answers 
will  bo  considered  lack  of  frankness, 

<?.  Careful  thought  to  answers-brief,  clear,  well-thought-out  an- 
swers to  problems  posed  by  interviewer.  Excessive  wordiness, 
self-contradiction,  disorganized  answers,  and  answers  that  do 
nothing  more  than  reiterate  the  problem  will  be  considered  evi- 
dence of  lack  of  careful  thought  to  answers. 
f>  Appropriateness  of  attitude-interest  and  enthusiasm  displayed 
toward  all  aspects  of  job,  but  without  pushiuess  or  opinionated- 
ness.  Interest  and  enthusiasm  may  bo  indicated  by  simply  slating 
their  presence  (c.;?.,  "John  appears  very  interested  in  the  tech- 
niques demonstrated")  or  by  a  direct  or  indirect  quotation  on  the 
part  of  the  applicant  expressing  enthusiasm  or  interest  (e.g.,  "Oi 
course  I  don't  mind  emptying  buckets.  I  want  to  learn  all  about 
the  business.").  Pushiuess  and  opinionateduess  may  he  indicated 
by  attempts  to  tell  the  interviewer  how  the  business  should  be 
run,  boasting  about  superiority  of  knowledge  or  ability  (as  op- 
posed to  offering  to  demonstrate  ability),  sarcastic  comments, 
attempts  to  bully  interviewer,  and  similar  actions.  General  lack 
of  enthusiasm  (indicated  by  description  or  quotation),  complaints 
about  specific  aspects  of  the  job,  or  the  presence  of  any  of  the  in- 
dications of  pushiuess  or  opinionateduess  will  be  considered  inap- 
propriate altitude. 

g.  Relevance  of  materials  presented -direct  and  obvious  relation- 
ship to  job  being  applied  for  of  a  ly  education-  or  experience' 
related  materials  brought  to  interview.  Examples  of  appropriate 
materials  arc  a  typing  award  for  a  secretarial  applicant,  or  a  port- 
folio of  works  from  a  high  school  design  course  for  an  applicant 
in  any  art-  or  design-related  field.  Examples  of  inappropriate 
materials  are  a  tennis  award  for  an  engineering  applicant,  or  \ 
record  of  offices  held  in  high  school  for  a  janitorial  applicant.  The 
relevance  or  irrelevance  of  such  materials  may  be  made  more  ob- 


-28- 


for  college,  clc.)\ind  what  working  conditions,  salary,  and  rate 
or  advancement  they  expect.  Inability  to  answer  specific  ques- 
tions dealing  with  these  issues  (e.g.,  "What  salary  do  you  ex- 
pect?" "I  don't  know.  What  did  you  plan  to  payH  or  working 
conditions,  salary,  or  advancement  expectations  that  are  excep- 
tionally high  or  low  for  tint  job  being  applied  for  (e.g.,  plans  to 
be  vice-president  of  company  within  two  years  of  being  hired  as  a 
secretary,  or  asking  only  $2.50  per/hour  for  work  requiring  a 
graduate  degree  or  highly  specialized  training),  will  be  con- 
sidered lack  of  .spi  cifie  and  realistic  goals. 

3.  The  interview  description  may  illustrate  completely  correct  be- 
havior, or  one  of  the  behavioral  factors  illustrated  may  exemplify 
erroneous  behavior,  whereas  the  rest  of  the  description  exemplifies 
correct  behavior.  No  more  than  20  percent  of  the  test  items  will  ex- 
emplify completely  correct  behavior. 

4.  The  description  may  include  direct  quotation  of  the  interviewer 
and/or  the  interviewee,  as  well  as  description  of  their  actions  and 
conversation. 

,5.  If  several  descriptions  are  used  in  a  test,  the  names  given  to  inter- 
viewers will  be  evenly  divided  between  male  and  female,  and  will 
include  some  named  characteristic  of  the  most  common  ethnic 
groups  in  the  population  to  be  tested.  The  name  to  be  used  with  a 
given  job  will  lie  chosen  at  random  so  that  discrimination  cannot  be 
made  on  the  basis  of  sex  or  ethnic  group, 

6.  The  readability  of  the  descriptions  will  be  no  higher  than  tenth- 
grade  level. 

Response  attributes 

1.  The  students  will  mark  on  their  answer  sheets  tin*  letter  that  cor- 
responds to  Ihe  error  made  by  the  job  applicant  (if  any)  or  the 
statement  that  "no  error  was  made." 

'2.  There  will  be  five  alternatives,  consisting  of  the  correct  response 
and  four  distractors.  The  options  will  include  the  response  "no  error 
was  made"  along  with  four  of  the  following  behavior  factors:  lack 
of  punctuality,  inappropriate*  dress,  lack  of  general  courtesy,  lack  of 
frankness,  lack  of  careful  thought  to  answers,  inappropriate  atti- 
tude, irrelevant  materials  presented,  ami  lack  ni  specific  and  MMlistie 
goals.  The  four  hchaviotal  hulnn  tliosin  will  i-nrn-spnud  to  loin  (.f 
the  factors  illustrated  in  intcmrw  description  and  will  include 
that  factor  (if  any)  in  which  an  error  is  illusli.ited. 
3.  The  correct  response  will  be  that  alternative  that  correctly  names 
the  error  illustrated  in  the  description  ol  the  interview  desciiption, 
or,  in  the  event  that  no  error  watf  illustrated,  that  alternative  that 
states  "no  error  was  made.'* 


ERIC 


EXAMPLE  OF  A 
DOMAIN  SPECIFICATION1 


UNIT  PRICING 
Objective: 

Identify  the  package  having  the  lowest  unit  price,  given  different 
sizes  of  the  same  brand  product  and  their  cost.    (Similar  to  perform- 
ance indicator  4EL.) 

Rationale: 

Retail  items  are  an  important  component  of  every  consumer's  budget , 
and  an  understanding  of  unit  pricing  is  essential  to  economic  buying 
habits. 


^rom  Milltnan,  J.,  &  Craig,  M.  M.  Rhode  Island's  educational  performance 
indicators  and  items:  An  independent  evaluation  and  feasibility  report.  Final 
Report.     June  1978.     (Reproduced  with  permission  from  the  senior  author.) 


65 


-30- 

Sample  Items: 


1.  The  unit  price  labels  for  three  packages  of 
Boundless  paper  towels,  each  of  a  different 
size,  are  shewn  below. 


Boundless  Towels 

2  Ply 

UNIT  PRICE 

RETAIL  PRICE 

7.7* 
per  sq.  yd. 

86* 
100  sq.  ft. 
50  sheets 

Boundless  Towels 

2  ply 

UNIT  PRICE 

RETAIL  PRICE 

7.1* 
per  sq.  yd. 

$1.19  * 
150  sq.  ft. 
75  sheets 

Boundless  Towels 

2  ply 

UNIT  PRICE 

RETAIL  PRICE 

8.1* 
per  sq.  yd. 

$2.25 
250  sq.  ft. 
125  sheets 

Which  package  is  the  most  economical? 

o 

50  sheet  size 
75  sheet  size 
125  sheet  size 

More  information  is  needed  to  answer 
The  question. 


66 

ERIC 


A. 
B. 
C. 
D. 


-31- 


2.   Two  packages  of  Zip  detergent  are  shown  below. 


s  

mm 

SIZE 

Net  Wt. 

Net  wt. 

3lb,  Zoz. 

Sib.  4oz.  A 

$1.48 

$2.45 

Which  package  has  the  lower  unit  price? 

A.  Family  size 

B.  Economy'  size 

C.  They  have  the  same  unit  price. 


Questions 


The  test  items  will  be  about  familiar  objects  found  in  a  grocery 
or  s&nilar  retail  store. 

Within  a  given  item,  all  packages  will  be  of  the  same  brand  and 
product  (and  thus  may  be  assumed  to  have  the  same  ingredients). 

Relevant  information  on  unit  and  price  will  be  presented  in  one 
of  three  forms: 

a)  unit  pricing  labels, 

b)  pictures  or  drawings  of  product  packages,  or 

c)  written  text  on  unit  and  price. 

Information  on  "coupons  inside  package"  or  "cents  off  the  marked 
price"  will  not  be  included. 


/ 


» 

67 


v  -32- 


In  each  question,  two  to  four  packages  will  be  identified. 

The  question  will  ask  which  package  either  has: 

a)  vthe  lowest  unit  price"  or  is 

b)  "the  most  economical" .  h 

Note:  "Best  buy"  language  will  not  be  used  since  considerations 
other  than  lowest  unit  cost  or  economy  enter'  into  the  de- 
termination of  best  buy.  .  « 

Prices  will  be  shown  using  the  $  symbol  for  amounts  of  $1.00  or. 
more;  the  $  symbol  will  be  used  otherwise. 

Across  items,  there  will  be  no  relation  between  the  size  of  the 
package  and  its  unit  price. 


Options 


Responses  will  be  presented  in  multiple-choice  format. 

There  will  be  only  one  correct  answer  per  question.' 

When  only  two  packages  are  shown,  "They  have  the  same  unit 
price"  will  be  used  as  one  of  the  response  options.  , 

The  option,  "More  information  is  needed  to  answer  the  question"; 
.  ?  can  be  used. 


Units  of  Quantity 


The  units  may  be  of  number,  weight,  length,  area,  dr  volume. 

More  than  one  unit  may  be  shown  on  any  One  package.    A  roll  of 
toilet  tissue,  for  example,  may  indicate  both  the  number  of 
sheets  and  the  area. 

If  the  units  are  not  comparable  across  products,  the  student  is 
to  answer  that  rare  information  is  needed  to  answer  the  ques- 
tion.   A  brand  of  shampoo  sold  in  both  liquid  and  paste  form, 
for  example,  may  be  expressed  in  noncomparable  units,  • 


68 


9 

ERIC 


Mathematics  Involved 


If  conversions  are  required  to  make  the  units  comparable,  then  the 
conversion  factors  will  be  provided  as  part  of  the  .problem. 

Prices  will  be  clgsen  so  that  the  most'  economical  item  will  be  at 
least  a  full  penny  less  expensive  than  other  items,  whether  or  not 
students  round  off  their  calculations. 

Students  win  not  have  to  compute  areas. 


63 


-34- 

Example  4  (Glendale  Union  High  School  District) 


PROGRAM     Language  Arts 


SUBDIVISION 


I.A.I. a.  -  g. 


COURSE      English  1-2 
test      Pre/Post  Test 


BEHAVIORAL  LEVEL 


SKILL/CONCEPT 


Application 


Parts  of  Speech 


RATIONALE 


Parts  of  speech  are  taught,  not  as  ends  in  themselves,  but  as  tools  for  the 
improvement  of  oral  and  written  communications.    Instructions  should  help  stu- 
dents see  how  parts  of  speech  work  together  to  form  meaningful  language  structures 


CONTENT  LIMITS 

1.  A  noun  is  the  name  of  a  person,  place,,  thing,  or  idea  (actor,  city, 
automobile,  kindness). 

2.  A  pronoun  is~a  word  which  can  be  substituted  for  a  noun. 

3.  Personal  pronouns  may  refer  to  the  person  speaking,  the  person  spoken  to, 

or  the  person  spoken  of  (I— me,  you,  he— him,  she—her,  we— us,  they— them). 

4.  Indefinite  pronouns  do  not  refer  to  any  specific  persons  o"r  things 
(everybody,  everyone,  somebody,  someone,  nobody,  no  one,  anybody,  anyone). 

5.  A  verb  is  a  word  that  shows  action  or  state  of  being  (rush,  bite,  is^,  are) . 

6.  A  verb  which  is  made  up  of  more  than  one  word  is  calleB  a  verb  phrase 
(is  leaving,  was  helping). 

7.  An  adjective  is  a  word  that  describes,  limits,  or  modifies  a  noun  or 
pronoun  (a  walnut  desk,  a  cloudy  day,  he  is  remarkable). 

8.  An  adverb  is  a  word  that  describes,  limits,  or  modifies  a  verb,  adjective, 

or  another  adverb  (came  quickly,  an  especially  fine  paper,  played  fairly  well) . 

9.  A  preposition  is  a  word  whicnshows  a  relationship  between  its  object  and 
some  other  word  in  a  sentence.  (We  flew  above  the  clouds.    They  lived  around 
the  corner.    She  is  at  home.) 

10.  A  conjunction  is  a  word  which  joins  words  or  groups  of  words  (and,  but,  or 
nor,  for,  yet) . 

11.  Correlative  conjunctions  are  used  in  pairs  with  other  words  dividing  them 
(both— and,  either— or,  nei ther—nor,  not  only— but  also,  whether— or) . 


ITEM  FORMAT 

Format  is  one  simple  or  compound  sentence  with  five  words  lettered  and  underlined, 
followed  by  two  or  three  questions  asking  for  identification  of  underlined  words. 


(continued) 


70 


-35- 


English  1-2 
PARTS  OF  SPEECH 
Page  2 


ITEM  FORMAT  (CONTINUED) 


Item  Restrictions 

1.  The  following  will  not  be  tested  or  used  as  distractors: 
.  verbals 

.  a_,  an,  and  the 

.  prepositions  ending  in  in£  (during,  concerning,  etc.) 
.  phrasal  prepositions  (according  to) 

.  adverbs  which  also  function  as  adjectives  or  prepositions  (deep,  up_,  etc.) 
.  pronouns  other  than  the  examples  shown  in  the  content  limits 

2.  Sentences  developed  for-  testing  parts  of  speech  will  not  contain 
infinitive  phrases. 


RESPONSE  DESCRIPTION 

The  student  will  demonstrate  application  of  rules  for  distinguishing  parts  of 
speech  by  identifying  parts  of  speech  in  simple  and  compound  sentences. 


DIRECTIONS  TO  THE  STUDENT 

Parts  of  Speech.  Read  the  sentence  and  answer  each  of  the  questions  below  it 
by  blackening  the  lettered  space  on  your  answer  sheet. 


CRITERIA 


The  student  will  correctly  identify  the  underlined  part  of  speech. 


ITEM 


A  B  C  D  E 

The  suspect  led  police  on  a  wild,  high-speed  chase  through  the  city. 


I.    Which  of  the  underlined  words  is  a  noun?  (D) 


2.    Which  one  is  a  preposition?  (E) 


71 


ERIC 


-36- 


Exampie  5  (Glendale  Union  High  School  Diatrict) 


PROGRAM      Language  Arts 
COURSE      English  1-2 
TEST      Pre/Post  Test 


SUBDIVISION 


I  .B.l.a. 


SKILL/CONCEPT  End  Marks  (MODEL  A) 
BEHAVIORAL  LEVEL  Application  


RATIONALE 

Question  marks  and  neriods  at  the  ends  of  sentences  are  taught  to  help 
students  achieve  clarity  in  written  communications. 


CONTENT  LIMITS 

1.  A  period  is  used  at  the  end  of  a  declarative  sentence. 
(I  spoke  to  them.) 

2.  A  question  mark  is  used  at  the  end  of  an  interrogative  sentence. 
(Did  vou  speak  to  them?) 


ITEM  FORMAT 

Format  is  a  set  of  four  sentences,  one  of  which  contains  an  error  in  the  use 
of  the  period  or  question  mark.    Each  set  of  four  sentences  will  contain 

1.  One  or  more  direct  statements 

2.  One  or  more  direct  questions 

3.  One  indirect  or  false  question 


RESPONSE  DESCRIPTION 

The  student  will  demonstrate  application  of  the  rules  for  using  periods  and 
question  marks  at  the  ends  of  sentences  by  identifying  errors  in  the  use  of 
end  mark  punctuation. 


CRITERIA 

The  student  will  identify  the  sentence  which  is  improperly  punctuated. 


72 


9 

ERIC 


-37- 


English  1-2 

EHf.  MARKS  (MODEL  A) 

Page  2  ■ 

DIRECTIONS  TO  THE  STUDENT 

Using  Periods  and  Question  Marks.    Find  the  sentence  which  contains  the 
error.    Blacken  the  lettered  space  on  your  answer  sheet. 

ITEM 

1.    A.  The  research  papers  will  be  due  on  March  13. 

B.  He  wants  to  know  whether  your  parents  will  attend  the  concert? 

C.  Why  do  you  think  your  teacher  made  such  strict  rules? 
[).  Mr.  Carson  will  be  in  Europe  during  July  and  August. 


Example  6  (Glendale  Union  High  School  District) 

P.KvJEAM        Reading  SUBDIVISION  Il.c.l.b.    

Paragraph  Meaning — Infer 

•  COURSE  Modern  Reading  Techniques  SKILL/ CONCEPT  red  Main  Idea  Int.  &  Jr. 
TEST    _  .  BEHAVIORAL  LEVEL  Synthesis 


RESPONSE  DESCRIPTION 

Given  ten  paragraphs  with  the  stated  main  idea  omitted,  the 
learner  will  demonstrate,  synthesis  of  the  skill  of  finding 
inferred  main  ideas  by  choosing  the  correct  statement  of  inferred 
idea  from  four  possible  choices. 


CONTENT  LIMITS 

The  reader  is  advised  to  find  inference  in  the  following  manner: 

1.  Look  at  the  details  to  see  if  they  have  something  in  common. 

2.  Make  a  guess  about  the  intended  meaning. 

3.  Review  the  clues  or  ideas  to  see  if  they  support  your  guess. 

An  inferred  main  idea,  then,  may  be  defined  as  the  idea  or  fact 
that  is  not  stated  but. is  drawn  from  stated  facts  or  ideas. 
Inference  involves  a  process  of  inductive  reasoning. 


ITEM  FORMAT 

The  learner  will  be  presented  with  ten  paragraphs  followed  by 
four  answers  which  will  state,  or  complete,  an  inferred  main 
idea,  one  of  which  will  be  correct,  and  three  o'f  which  will  be 
ais tractors.     The  paragraphs  are  graded  in  level  of  difficulty 
from  grade  3  through  8.     (Questions  1-5  are  intermediate,  grades 
3-5.     Questions  6-10  are  junior,  grades  6-3.) 


learner  mujt  select;  -.he  correct  inferred  main  idea  wi~h 
pro  nciency . 

74 


-39- 


•P5.0GJUM    Reading   SUBDIVISION  H.C.l.b.  
•h.           '                                   ~~                           Paragraph  Mean  in.?-- *a  far  i  *:U 

s^J)URSE    Modem  Reading  Techniques  SKILL/ CONCEPT   Main  Idea    Int.  *  Jr.  

JEST      BEHAVIORAL  LEVEL     Synthesis  '  


DIRECTIONS 

Paragraph  Meaning- -Inferred  Main  Idea  _ 

You  will  be  asked  to  draw  a  conclusion  about  what  is  really 
meant  in  each  of  the  following  passages.    After  carefully  read 
ing  each  paragraph,  on  the  answer  sheet  blacken  the  space  of 
the  letter  which  best  states  the  inferred  meaning., 

ITEM 


EXAMPLE: 

The  house  was  run-down.    After  twelve  years  it  still 
was  not  painted.    There  was  no  porch;  crude  wooden 
steps  led  up  to  the  warped  front  door.     The  outside > 
light,  hanging  down  by  its  cord,  swung  to-and-fro  with 
the  night  breeze  .    The  house  was  unfinished  on  the 
inside  too.    The  ceiling  was  only  plasterboard  hap- 
hazardly nailed  in  place.     Paint  and  plaster  were 
cracking  and  flaking  onto  the  floor  from  the  walls. 

(a)  The  house  had  not  been  painted. 

(b)  the  inside  of  the  houpe  had  plasterboard  ceilings. 

(c)  The  paint  flaked  onto  the  floor. 

(d)  The  house  was  in  disrepair. 

The  correct  answer  is  "d." 


1.     I'm  thinking,  I'm  thinking 
So  leave  me  alone. 
I  don't  need  your  help. 
I'll  do  fine  on  my  own. 

I  have  a  few  problems 

I  have  to  work  out , 

Which  cannot  be  done' 

If  you  stand  there  and  shout. 


"We  need  you  for  baseball, 
So  come  right  away." 
I'll  come  when  I  feel 
I  am  ready  to  play. 

Please  stop  making  faces. 
It  won't  help  to  grown. 
I'm  thinking,  I'm  thinking, 
So  leave  me  alone. 


The  writer 


(a)  dislikes  everybody  all  the  time 

(b)  rever  chinks  by  himself 
'c)  sometimes  likes  to  be  alone  to 
>,d)  is  not  liked  by  other  people: 


:hink 

75 


ERIC 


-40- 


KxampLe  7  (Glendale  Union  High  School  District) 
CO.j4lJi(R)   SUBDIVISION  V-E. 


Algebra  1-2  BEHAVIORAL  LEVEL  Application 


SKILL/ CONCEPT  Factoring  Polynomials 

With  More  Than  Three  Terms 


RESPONSE  DESCRIPTION; 


Factor  a  polynomial  of  more  than  three  terms 


CONTENT  LIMIT: 


2  2  2- 

A  polynomial  of  four  terms  of 'the  form  a    +  2ab  +  b  "  -  c  .where 

c  is  an  integer  between  0  and  5,  inclusive. 


ITEM  FORMAT: 


See  I. A. 1.  for  a,  b,  c 

d)  One  problem 

2 

e)  One.  wrong  answer  (b)  will  be  (a  +  c)  (a  -  c)  (2ab  +  b  ) 

f)  One  wrong  answer  (c)  will  be  b(a  +  c)'(a  -  c) (2a  +  b) 

g)  One  wrong  answer  (d)  will  be  (a  +  b  -  c)  ^ 


CRITERIA: 

Select  correct  answer 
DIRECTIONS: 


ITEM 


Factor  Completely:  . 

a2  +  6ab  +  9b2  -  25 

a)  (a  +  3b  +  5)  (a  +  3b  -  5) 

b)  (a.,+  5)  (a  -  5)  (6ab  +  9b2) 

c)  3b(c  +  5) (a  -  5) (2a  +  3b) 

d)  (a  •:-  3b  -  b)2 

e)  None-  of  the  above 


76 


;  2/28/79 

Example  81  ' 


SKILL:    The  student  will  identify  the  tone  or  emQtion  expressed  in  a 
paragraph. 

SAMPLE  DIRECTIONS  AND  TEST  ITEM: 

Directions:    Read  the  paragraph.    Underline  the  best  word  to 
«  *       complete  the  sentence. 

Jimmy  had  been  playing  at  the  beach  all  day. 
It  was  time  to  go  home.    Jimmy  sat  down  in  the 
back  seat  of  the  car.    He  could  hardly  keep  his 
eyes  open. 

9 

Jimmy  felt  «  • 

A.  afraid  B.  friendly  C.  tired  D.  kind 


CONTENT  DOMAIN: 


1,.  The  paragraph  will  contain  situations  which  are  familiar 
to  the  students  being  tested. 

2.  The  paragraph  will  contain  no  less  than  three'  and  no  mo\e^ 
than  six  sentences.    The  readability  level  will  be  no 
higher  than  Second  Reader. 

3.  The  tones  or  emotions  expressed  will  be  from  the  following 
list: 

sad  mad  angry 

tired  scared  friendly 

happy  lucky  smart 

kind  excited  proud 


RESPONSE  MODE: 

1.  Responses  will  be  one  word  in  length. 

2.  The  items  will  contain  one  correct  and  three  incorrect 
responses . 

3.  Distractors  are  to  be  words  describing  a  feeling  and  may 
be  taken  from  the  list  above. 

4.  Avoid  using  reasonable  answers  as  distractors  (i.e.,  in 
the  sample  item,  "mad"  would  not  .be  a  good  choice  for  a 
distractor— Jimmy  could  feel  mad  about  leaving  the  beach) . 


*An  example  of  a  domain  specification  from  the  reading  area.  (The 
authors  are  grateful  to  Marlene  Teichert  of  Educational  Progress  for  the 
example. ) 


•  77 


-42- 

Example  9* 


Content: 

Reading 

4 

Strand: 

Comprehension 

Level: 

2 

SKILL 

9 

A  student  will  be  able  to  identify  the  main  idea  of  a  paragraph  by  choosing 
the  best  title. 


SAMPLE  TEST  DIRECTIONS  AND  TEST  ITEM 


Test  Directions 


Read  each  paragraph  and  choose  the  best  title.  Circle  the  letter  beside 
your  answer. 

Test  Item 

The  second  grade  went  on  a  class  trip.    They  saw  airplanes  and  jets. 
A  man  told  them  how  to  buy  an  airline  ticket.    They  saw  the  pilot's 
cockpit • 

a.  Meeting  a  pilot . 

b.  Buying  a  ticket. 

c.  A  trip  to  the  airport. 


CONTENT  -DOMAIN 

1.  Sentences  should  have  no  less  than  3  words,  and  no  more  than  10. 

2.  Each  paragraph  should  have  no  less  than  3  sentences  and  no  more 
than  7. 

3.  Compound  and  simple  sentences  should  be  included. 

4.  Readability  should  be  approximately  2.5. 

5.  The  paragraphs  should  include  both  experience  and  interest-oriented 
subject  matter. 


CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

H  1.    There  should  be  three  titles  to  choose  from.    The  correct  title 
is  the  main  idea  of  the  paragraph. 

2.     Distractors  should  contain  smaller  details  from  the  paragraph. 


ERIC 


lAn  initial  draft  of  the  content  included  in  this  domain  specification 
was  prepared  by  Liz  Jerrett  and  Nancy  Cole. 

78 


Example  10 1 


-43- 

Content: 

Reading 

Strand: 

Structural  Analysis 

Level : 

3 

SKILL 

The  student  will  identify  the  meaning  of  a  word  consisting  of  a  root 
word  and  a  prefix. 


SAMPLE  TEST  DIRECTIONS  AND  TEST  ITEM  , 
Test  Directions 

Read  the  word  and  the  two  definitions  that  follow  it.    Choose  the 
correct  meaning  of  the  word  and  place  the  letter  (A)  or  (B)  on  the 
line  in  front  of  the  word. 


Test  Item 

 1.  disappear 

 2.  exterior 

 3.  incapable 


(A)  to  appear  again  (B)  to  drop  out  of  sight 
(A)  outside  (B)  inside 

(A)  can  do  (B)  can't  do 


CONTENT  DOMAIN 

1.  The  stimulus  words  will  contain  the  following  prefixes: 

un  re  in  dis  ex 

2.  The  words  are  to  be  at  a  vocabulary  level  no  higher  than  level  four. 

3.  The  words  are  not  to  be  included  in  the  context  of  a  sentence. 

4.  See  attached  list  of  words  for  suggested  content. 


CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

1.  The  student  will  write  the  letter  of  the  correct  response  on  the 
line  provided, 

2.  There  will  be  two  choices,  the  correct  response  and  one  distractor. 

3.  The  distractor  will  contain  a  meaning  for  the  root  word  without  the 
prefix  or  the  meaning  of  the  root  word  with  a  different  prefix. 

lAn  initial  draft  of  the  content  included  in  this  domain  specification 
was  prepared  by  Marlene  Teichert. 


79 


-44- 


Suggested  List  of  Words 


un 

in 

re 

dis 

ex 

uneven 

insincere 

remind 

disown 

exclude 

unclean 

inhuman 

reform 

disconnect 

exclaim 

unfold 

insight 

rename 

discover 

exhale 

untie 

incapable 

regain 

disband 

exit 

unreal 

informal 

rejoin 

disloyal 

expand 

unsafe 

inability 

replant 

displease 

expel 

untrue 

inclose 

retold 

dishonor 

expire 

unfit 

indent 

recall 

discount 

explain 

uneasy 

inland 

reopen 

dismount 

explore 

^unhappy 

indoor 

renew 

disarm 

extend 

unpack 

incomplete 

reread 

disorder 

exterior 

unload 

intake 

refill 

disable 

\ 

I 

I 

I 

\ 


so 


-45- 


Example  ll1 


Content : 

Mathematics 

Strand: 

Fractions 

Level : 

4 

SKILL 

The  student,  will  be  able  to  multiply  fractions. 

SAMPLE  TEST  DIRECTIONS  AND  TEST  ITEM 
Test  Directions 

Circle  the  answer  to  the  question  below: 

Test  Item 

1         3  = 
3*4 


/ 


3 
7 


b.  4 

7 


c.  1 
4 


d.  4 
9 


e.  9 
4 


CONTENT  DOMAIN 


1.  Fractions  will  be  written  using  the  horizontal  bar  (-£■)  . 

2.  Limit  to  fractions  less  than  one  with  single  digit  denominators. 

3.  The  numerators  and  denominators  of  each  fraction  in  an  item  stem 
will  have  no  factor  in  common  other  than  one* 


4.     In  the  item  form  *  -  c 


or    £  x  il  5  a  and  d  will  have  no 


b^d  d   ^  b  '  — 

common  factor  except  one,  and  each  of  the  following  cases  will 

be  included  in  the  items: 

a.  t>  is  a  multiple  of  £,  or  £  is  a  multiple  of  _b; 

b.  b  and  £  are  equal; 

c.  J)  and  £  share  a  common  factor  other  than  one; 

d.  b  and  c  share  no  common  factor  except  one. 


1An  initial  draft  of  the  content  included  in  this  domain  specification 
was  prepared  by  a  group  of  teachers  working  with  the  Accountability  Renewal 
Model  Project  in  Texas. 


81 


3»> 


ERIC 


-46-. 


CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

*  # 

1.  Each. item  will  contain  five    answer   choices,  only  one  of  which 
is  correct. 

V 

2.  "None  of  these"  will  not  be  used  as  a  answer  choice. 

3.  Distractors  will  'represent  the  most  frequent  student  errors. 

4.  Distractors  will  include  errors  such  as:" 

a.  multiplication  of  numerators  and  addition  of  denominators; 

b.  "cross  multiplication"  (numerator  x  denominator)  either  end  up; 

c.  addition  of  numerators  and  denominators. 

5.  Equivalent  forms  of  the  cqrrect  answer     will  not  be  used  in  a 
set  of  answer  choices. 


82 


-47- 


Exatnple  12 1 


Content: 

Mathematics 

Strand: 

Life  Skills 

Level: 

7 

SKILL 

Student  will  use  reference  units  of  weight/mass,  length,  area,  volume, 
temperature,  time,  and  money  to  estimate  and  determine  measures,  both 
metric  and  customary. 


SAMPLE  TEST  DIRECTIONS  AND  TEST  ITEM 


Test  Directions 

Circle  your  answers  to  the  questions  below: 
Test  Items 

1.    The  distance  from  Fort  Worth  to  Austin  would  be  measured  in 

a.  kilometers 

b.  kiloliters 

c.  kilograms 

d.  liters 


2. 


3. 


e.  grams 


5 
5 


5 
5 


©  © 


Find  the  value  of  money  shown  above. 

a.  $5.26 

b.  $5.06 

c.  $25.51 

d.  $5.31 

e.  $1.40 


What  is  the  correct  time? 


a. 
b. 
c. 
d. 
3. 


11:15 
2:50 
2:55 
3:55 
3:50 


LAn  initial  draft  of  the  content  included  in.  this  domain  specification 
was  prepared  by  a  group  of  teachers  working  with  the  Accountability  Renewal 
Model  Project  in  Tex  s. 


S3 


-48- 


CQNTENT  DOMAIN 

1.  Limit  the  unit  measurements  to  the  following! 

kilometer,  meter,  centimeter,  millimeter,  mile,  yard,  foot,  inch, 
the  square  unit  of  the  previous  listed,  the  cubic  unit  of  the 
previous  listed,  literi  milliliter,  kilogram,  gram,  gallon,  quart, 
ton,  pound,  ounce,  houi:,  minute,  second,  degree  Fahrenheit, 
degree  Celsius. 

2.  The  following  types  of  items  will  be  included: 

a.  Items  which  reference  a  real  life  object  familiar  to  students, 
and  a  characteristic  of  the  object  to  be  measured.    The  student 
will  select  the  appropriate  unit  to  measure  the  object.  Items 
will  include  measurement  of  weight/mass  and  capacity;  area  and 
volume;  and  length. 

b.  Items  which  deal  with  money.    The  student  will  select  the  total 
amount  of  money  or  the  appropriate  collection  of  bills  and  coins. 
There  will  be  no  more  than  9  or  fewer  than  3  bills  and/or  coins 
pictured  in  the  stems  or  response  choices. 

c.  Items  which  deal  with  time,  length,  and  temperature.     The  student 
will  be  presented  a  picture  and  will  give  the  appropriate  time, 
length  of  object,  or  temperature.     Time  will  be  measured  to  the 
nearest  minute.    Temperature  will  be  measured  to  the  nearest 
degree. 

3.  Conversions  will  not  be  required. 

4.  Items  will  not  tr      knowledge  of  abbreviations, 

CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

1.  Each  item  will  contain  five    answer    choices,  only  one  of  which  is 
correct. 

2.  MNone  of  these11  will  not  be  used  as  an  answer  choice. 

3.  Discractors  will  represent  the  most  frequent  student  errors. 

4.  More  than  one  unit  that  measures  the  same  characteristic  will 
not  be  used  in  the    answer    choices  for  an  item  described  in 
content  2a. 

5.  Answer      choices  will  include  the  unit  of  measurement,  where 
appropriate. 


84 


-49- 


Example  131 


Content: 

Mathematics 

Strand s 

Life  Skills 

Level : 

7 

SKILL 

A  student  will  identify  the  page  (or  pages)  from  a  newspaper  index  on 
which  information  related  to  a  given  topic  can  be  found. 


SAMPLE  TEST  DIRECTIONS  AND  TEST  ITEMS 


Test  Direct-fous 


Please  read  the  newspaper  index  below  and  answer  the  questions, 
Circle  the  letter  beside  your  answer  to  each  question. 

Test  Items 


The  Austin 

Record  News 

Amusements  E5-7 

Horoscope 

Fl 

Classified  Ads  F4-6 

Personalities 

B5 

Comics  Cll 

Sports 

Cl-8 

Editorials  A14 

TV  Logs 

E9 

Financial  D4,5 

Weather 

F3 

1.  Where  would  you  find  information  about  a  person  born  under  the 
sign  of  Scorpio? 

(a)  F4-6  (b)  B5  (c)  F3  (d)  Fl 

2.  Where  would  you  find  the  necessary  qualifications  for  available  jobs? 

(a)  E9  (b)  F4-6         (c)  B5  (d)  F3 

3.  If  you  wanted  to  find  tna  standings  in  the  National  Football  League, 
where  would  you  look? 


(a)  Cl-8 


(b)  E5-7 


(c)  Cll 


(d)  E9 


4.    Where  would  you  read  a  person's  opinion  on  a  current  local  political 
matter? 

(a)  A14  (b)  Cll  (c)  D4,5  (d)  Cl-8 

1An  initial  draft  of  the  content  included  in  this  domain  specification 
was  prepared  by  a  group  of  teachers  working  the  Accountability  Renewal  Model 
Project  in  Texas. 


85 


-50- 


CONTENT  DOMAIN 

1.    A  newspaper  index  will  be  reproduced  containing  both  a  section 
letter  and  page (a)  for  each  topic* 

2      The  newspaper  index  will  contain  no  more  than  ten  topics  listed 
in  one  or  two  co lunula • 

The  ten  (maximum)  topics  selected  will  be  from  the  following  list: 


4.     Test  items  will  relate  to  the  topics  listed  but  will  not_  name 
€he  topics.. 

CHARACTERISTICS  OF  ANSWER  CHOICES  AND  SCORING 

1.  There  will  be  one  correct  and  three  incorrect  answers  for  each 
test  item. 

2.  Avoid  having  distractors  that  could  be  possible  answers  (i.e.. 


for  a  test  item  such  as  "where  can  you  find  information  on 
Ferguson  Jenkins,  do  not  include  both  "s porta"  and  "Personalities1! 
as  possible  answer  choices). 


3.     Incorrect  choices  will  be  other  page  numbers  listed  in  the  newspaper 

index. 

'« -     Some  of  the  incorrect  choices  should  include  the  same  section 
latter  3«  t*h«  correct  answer* 


Amusements 
Classified  Ads 
Comics 
Crosswords 
Deaths 


Editorials 

Financial 

Food 


TV  Log 

Weather 

Personalities 


Horoscope 
Sports 


86 


-51- 


Example  14 1 

♦ 

Applications  of  Approaches  to  Nora-Kef cvo_njiCjLXcji^ 


General  Description 

Given  a  description  of  a  situation  requiring  the  interpretation 
and  use  of  a  set  of  norm-referenced  test  scores,  the  student  will  select 
from  a  list  of  reliability  coefficients  the  coefficient .that  should  be 
computed,  based  upon  the  description  of  the  given  situation. 


Sample  Item 

Directions:     Read  each  testing  situation  described  below.  Describe 
which  reliability  coefficient  would  be  best  suited  for 
the  situation  described.     Write  the  letter  preceding 
the  reliability  coefficient,  you  have  chosen  on  the 
separate  answer  sheet. 

Ms.  Jones,  a  ninth  grade  history  teacher,  has  con- 
structed a  final  examination  in  History  9  for  the 
fall  semester.     She  doer;  not  have  access  to  either 
machine  scoring  facilities  or  computer  analysis, 
which  reliability  coefficient  would  be  best  suited 
for  her  to  compute  given  the  situation? 

a.  Coefficient  of  Stability  and  Equivalence 

b.  Kuder-Riehardson  20 

c.  Kuder-Richardson  2i 

d.  Coefficient  of  Equivalence 
c.  Coefficient  of  Sr ability 


JWe  arc  grateful  to  participants  at  an  AI'.RA  training  program  held 
in  Toronto,  March  1978,   for  com;  icier  ah  1  c  holp  In  formula!  log  Ihiu  domai 
spec  i  f  i  cat i  on . 


87 


Stimulus  Attclbujes 

1.  Each  of  the  test  items  will  consist  of  3  parts:     (a)  a  passage 
describing  the  norm-referenced  testing  situation,  (b)  a  question 
requiring  the  student  to  choose  which  reliability  coefficient  is 

br-st  suited,  and  (r)  a  uet  or<   f  »>''c  p»>?.rti  h!<«  ,'inj;wur::;. 

2.  The  passage  describing  the  norm-referenced  testing  Situation  will 
consist  of  100' words  or^JLess. 

a.  The  situation  described  will  contain  references  to  paper/pencil 
tests  and  performance  tests  only;  no  physical  diagnostic  tests 
(e.g»,  hearing)  \;ill  be  described. 

b.  The  passage  describing  the  costing  situation  will  include: 

i.  the  purpose  of  testing 
ii.  the  area  of  testing 
iii.  whether  the  test  is  a  group  or  individual  test 
iv.  whether  the  test  is  standardized,  teacher-made,  quasi- 
standardized  (i*e. ,  used  by  all  teachers  in  a  school) 

c.  The  testing  situations  described  may  measure  either  the  cognitive, 
affective,  or  psychomotor  domains. 

d*    The  situations  described  are  limited  to  the  following  categories 
of  use  of  norm-referenced  test  scores: 
i.  grading 
ii*  selection 
.  iii.  placement 

iv*  program  evaluation 
v.  ability  grouping 
vi .   ind  i  v  i dual   diaguos  i ;: 

(See  the  specification  supplement  for  an  expanded  discussion  ol  th 
categories . ) 

c.     Situation;;  describing  the  uses  of  tcbU;  in  research  or  in  the 
generation  of  research  hypotheses  are  noL  applicable!. 

88 


f.    The  siuu-  :  -»rw  described  also  involve  an  explication  oi  the 
following  viric'  'os  (when  appropriate) 
i.  speeded  or  power  test 
ii.  similarity  in  difficulty  of  items  (homogenlety  of  items) 
iii.  when  test  is  given  (during  program  or  at  end) 
iv.  the  exact  nature  of  the  test  (aptitude,  achievement,  or 
psychomotor) 

v.  whether  supplemental  aids,  such  as  a  computer,  etc. ,  are 
available 

3.    Following  the  paenage  describing  the  nora-refcrenced  tooting  situa- 
tion, each  item  will  contain  the  following  question:  "Which 
relaibility  coefficient,  would  be  best  suited  for  him/her/ them  to 
compute,  given  the  situation?" 

A.    The  set  of  5  possible  answers  should  be  written  subject  to 
the  restraints  set  up  in  the  Kesponr.c  Attributes  section. 

5.    All  passages  should  be  written  at  no  higher  than  the  10th  grade 
reading  level. 

lip sponse  Attributes 

1.  Students  will  be  asked  to  circle  the  letter  beside  one  of  five  possibl 
answers  (whore  one  answer  is  correct  and  the  other  four  answers  are 
incorrect) . 

2.  Thn  correct  answer    and  the  four  d i si  meters  should  be  chosen  from 
the  following  Hal  of  reliability  coef  I 'jrienl  « : 

a.  . coefficient  of  stability 

b.  coefficient  of  equivalence 

c.  coefficient  of  stability  Mil  equivalence.. 


80 


I 


-54- 


d.  split  half  (odd-even)  reliability  coefficient 

e.  split  half  (first  half-last  half)  reliability  coefficient 

f.  Kuder-Rlchardson — 20  coefficient 

g.  Kuder-Kichardsoit — ?.L  coe.t'f  LcSent 

h.  Inter-ratfer  reliability  coefficient 

3.    For  each  item,  the  four  distractors  chosen  will  be  the  four. of 

the  seven  possibili ties  that  are  most  nearly  suited  for  the  given 
situation. 

A.    Answers  requiring  combinations  of  reliability  coefficients, 

and  "all  of  the  above"  and  "none,  of  the  above"  will  not  be  uscul. 

5.     The  correct    answer    will  be  consistent  with  the  discussion  and 
appropriate  procedure  presented  in  currently-msed  educational  and 
psychological  measurement  texts  (i.e.,  Thorndike  and  Hagen,  Brown, 
Stanley  and- Hopkins,  Payne,  Sax). 


Specification  Supplement 

The  following  list  of  situations  to  be  used  for  developing  the  test 
item  passages  is  an  expansion  of  those  listud  in  the  Stimulus?  /retributes. 
This  expansion  is  necessary  to  further  delimit  the  content  areas. 


General  Description 


1.  Grading 


Particular  Situation 


a.  classroom  grading  on  a  unit  of  work,  no 
computer-assisted  facilities  available 

h.  classroom  grading  on  a  unit  of  work, 
computer  fac  lilies  available 

c.  end-of-course  grading  on  final  exam,  no 
facilities  available1 

d.  end-o r-rouff  gt'^ding  on  final  exam, 
facilities  available 


30 


mam  mn^t  n «  * "  •  3  *  r*  3  ?■ 

£yl  I.;  . 


-55- 


Ceneral  Description 


Particular  Situation 


2.  Selection 


3*  Placement 


4.  Program  Evaluation 


5.  Ability  Grouping 


a*  selection  of  a  group  for  a  special  course), 

iH.;.in<;  nn  achley'^.^at   list  score 
1).  selection  oL*  a  group  .(uj.  a  special  activity, 
using  an  aptitude  test:  score 

a.  placement  of  an  individual  in  an  accelerated 
program,  ucing  an  achievement  test  score 

b.  placement  of  an  individual  in  «a  special  class, 
based  on  observation  of  psychomotor  abilities 

c*  .placement  of  an  individual  in  a  special  group, 
based  on  a  projective,  test 

a.  evaluation  of  a  program  using  fjUiat^test 
scores 

a.  placing  students,  into  ability  groups  based  upon 
i*  achievement  test  scores 
ii.  aptit  ade  test  scores 
ill*  per  for  mance  test  scorer-; 


6.  Individual  Diagnosis 


a,  diagnosis  utilizing? 

i.  achievement  test  score 
ii.  teacher-constructed  test  score 


9 

ERIC 


91 


2.5    Item  Forms  Analysis 

Hively,  et  al.   (1973),  using  the  work  of  Osburn  (1968)  and  Hively, 
Patterson,  and  Page  (1968)  as  a  basis,  have  developed  a  comprehensive  method- 
for  developing  a  domain  definition  called,  item  forms  analysis.    Based  on  « 
Osburn1 s  notion  of  a  "universe-defined"  test  and  consonant  with  Traub's 
explication  of  strong  domain  sampling  validity,  Hively  et  al,  felt  (initially) 
that  their  domain  definition  should  satisfy  the  following  two  requirements: 

1.  All  the„,items  which  could  be  written  from  the  content  domain 
to  be  tested  must  be  written  (or  known)  in  advance  of  the 
final  item  selection  process. 

2.  A  random  or  stratified  sampling  procedure  mUjSt  be  used 
in  the  item  selection  process. 

Before  discussing  item/forms  analysis  in  some  detail  and 
also  providing  some  examples  of  item  forms,  two  comments  ^  *• 

should  be  made.    One,  the  experience  of  Hively  et  al.,  in  developing  item 
forms,  pointed  out  one  very  , glaring  weakness  of  item  generatior  procedures; 
they  work  well  only  with  very  structured  subject  domains,  such  as  mathematics 
(the  subject  Hively  and  his  associates  considered).    Two,  and  perhaps  more 
important,  Hively  et  al. ,  found  that  while  their  attempt  to  specify  "all 
the  behaviors  which  comprise  specific  pieces     of  knowledge"  was  a  great 
'quantum  leap'  over  the  use  of  behavioral  objectives,  it  was  apparent  that 

it  was  impossible  "to  exhaustively  define  universes,  of  criterion  behavior." 

O 

This  forced  Hively  and  his  associates  into  reconsidering  the  first  require- 
ment of  their  domain,  which  we  listed  above.     They  began  to  define  the  sets 
of  test  items  not  as     universes     of  items  but  as  the  "  nuclei     of  hypothe- 
tical repertoires  of  behavior,"  called  "domains"      (Hively  et  al,  1973). 

92  * 


-57- 


Hively  and  his    associates    found  it  an 
impossible  task  to  list  all  the  items  in  the  content  area  under  consider- 
ation, and  thus  were  forced  to  recoriceptualize  their  approach  in  terms  of 
domains  of  behavior,  to  which  a  group  of  items  may  belong.    According  to 

Hively  et  al. : 

The  basic  notion  underlying  domain-referenced  achieve- 
ment testing  is  that  certain  important  classes  of 
behavior  in  the  reportoires  of  experts  (or  amateurs) 
can  be  exhaustively  defined  in  terms  of  structured 
sets  or  domainsof- test j;i terns.    Testing  systems  may  be 
referenced  to  these  domains  in  the  sense  that  a  test- 
ing system  consists  of  rules  for  sampling  items  from 
a  domain  and  administering  them  to  an  individual 
(or  sample  of  individuals  from  a  specified  popula- 
tion) in  order  to  obtain  estimates\of  the  probability 
that  an  individual  (or  group  of  individuals)  couid 
answer  any .given  item  from  the  domain  at -a  specified 
moment  in  time. 

Domains  of  test  items  are  structured  and  built  up 
through  the  specification  of  stimulus  and  response 
properties  which  are  thought  to  be  important  in 
shaping  the  behavior  of  individuals  who  are  in  the 
process  of  learning  to  be  experts.    These  pro- 
perties may  be  thought  of  as  stratifying  large 
domains  into  smaller  dbmains  or  subsets. 

Hively  et  al. ,  use  item  generation  forms  to  specify  these  domains 
of  behavior  and  thus  circumvent  the  problem  of  trying  to  exhaustively  define 
the  universe  on  the  individual  item  level.    We  should  note  that  this  switch 
in  conceptualization  from  "universe  of  items"  to  "domain"  does  not  affect 
the  inferential    procedures  that  can  be  used.     If  one  can  develop  these 
domains  through  the  use  of  item  generation  forms,  the  strong  domain  sampling 
validity  situation  (Traub,  1975)  will  have  been  attained.    The  test  developer 
can  feel  confident  in  making  an  inference  about  what  the  examinee  knows 
about  the  domain,  based  upon  his/her  test  score. 

We  have  discussed  the  conceptual  switch  from  "universe  of  items"  to 
"domain"  by  Hively  and  his  associates  for  two  reasons.     One,-  we  are  trying 


-58- 

to  give  the  reader  another  sense  of  the  "idealized"  nature  of  being  able  to 
specify  a  content  universe  and  to  present  a  case  for  how  difficult  it 

is  to  specify  a  domain.    This  should  reemphasize  the  practical  utility  of  • 
Popham's  domain  specification  procedure.    Two,  we  have  tried  to  present  a 
context  in  which  to  understand         the  examples  that  follow. 

With  these  basics  specified,  we  will  do  the  following  in  terms  of 

I** 

\ 

item  forms  analysis.    First,  we  wi^l  formally  define  an  item  form,  as 
specified  .by  Hively,  and  discuss  three  strategies      Hively  et  al.,  have  suggested 
for  developing  the  domain  which  the  item  forms  represent.  Then,  in  section '2. 6,  we 
will  provide  an  example  of  an  item  form  and  briefly  discuss  the  elements  that 
constitute  an  item  form.    This  will  be  done  on  a  cursory  level;  the  reader 
should  refer  to  the  "work  of  Hively  et  al.  (1973)    for  a  detailed 

discussion.    We  are-here  trying  to  give  a.  "flavor"  of  the  approach;  and  are 
not  going  to  do  justice  to  the  subtleties.    Finally ,  in  s ection  2.6,  we  will  pro- 
vide four    other  relevant  examples  of  item  forms,  all  taken  from  Hively  et  ard.  's , 

work  on  the  Minnemast  Project. 

How  does  Hively  formally  define  an  item  form?    According  to  Hively,  et  al. 

(1973): 

Items  are  written  as  scripts  directing 
.the  actions  of  an  examiner,  with  space  pro- 
vided in  which  to  record  the  responses  of  a  student. 
Certain  elements  in  the  scripts  are  variable.  .  .'Item 
forms'  determine  the  domains  of  permissable  replacements 
for  these  variables.     By  sampling  items  from  these 
domains,  one  can  estimate  the  proportion    of  students  who 
'have'  the  system  of  concepts  and  skills  represented  by 
the  item  form  as  a  whole,  as  well  as  the  proportions 
who  respond  correctly  to  various  subcomponents. 

How  does  one  first  develop  the  general  domains  in  which  the  item  forms 
serve  as  item  generators?      Hively  et  al  (1973),  list  three  possible 
strategies  for  developing  the  domains: 

1.     Start  with  a  list  of  prototype  items  taken  from  the  instructional 

94 


material  and  then  alter  these  items  to  produce  sets  of  equivalent 
items  measuring  the  objectives  supposedly  measured  by  the  prototypical 
items.    Then  have  content  experts  review  the  xLems  so  as  to  end  up 
•     with  a  pool  of  items  which  purport  to  measure  the  instructional 
objectives . 

2.  State  the  instructional  objectives  and  have  the  item  writers  develop 
items  which  supposedly  measure  the  instructional  objectives. 

3.  Develop  hypotheses  about  sequences  and  hierarchies  of  instruction 
through  a  careful  examination  of  the  basic  goals  of  the  instructional 
unit.    Then  construct  items  in  accordance  with  these  sequences  an'd 
hierarchies. 

Regardless  of  the  way  in  which  domains  are  initially  defined  and 
developed,  at  one  point,  it  is  necessary  that  item  forms  be  constructed. 
Hively  et  al.  (1973)    give  two  relevant  reasons  for  the  use  of  item  forms: 

1.  To  obviate  the  necessity  to  store  individual  items  by  substituting 

/ 

a  set  of  written  rules  through  which  items  can  be  generated  .when 
i    needed,  and 

2.  f  enable  the  relationships  among  items  to  be  traced  by  giving 
clear  specifications  of  relevant  item  characteristics. 

In  other  words,  the  collections  of  items  generated  by  any  of  the  three  pro- 
cedures just  discussed  are  organized  into  "formalized  schemes,"  these  schemes 
being  the  item  forms.     Each  item  form  is  made  up  of  two  major  parts,  one  part 
tells  how  one  would  geierate  the  items,  the  other  describes  the  tteirfs 
characteristics. 

As  a  means  of  summing  up  this  discussion  bi  item  forms  analysis,  we 
can  make  the  following  comment.     If  it  is  possible  to  explicitly  define  the 

domain,  we  feel  that  item  generation  forms  are  the  mode  to  use.     However,  as 

/ 


r"  _60_  / 

the  reader  can  see,  both  from  the  discussion  just  presented  and^the  ensuing 
examples, .the  complexities  of  specification  can  be  enormous.    /Also,  the 
procedures  work  only  for  highly  structured  subject  domains.    It  is  for 
the  above  two  reasons 'that  we  prefer  the  use  of  Popham's  procedure  for 


v5 


the  development  of  domain  specifications.    However,  we  need  to  again  point 

out  that  the  domain  specification  procedure  developed  by  Popham  only  implicitly 

s  s  ; 

defines  the'  domain  in  question,  and  the  items  generated  need  to  be  validated 
by  an  independent  method.    That  independent  method,  the  use  of  content 

i 

specialists,  will  be  discussed  in  Unit  3. 


96 


ERIC 


-61- 

2.6    Examples  of  Item  forms  analysis 

In  this  section,  we  first  provide  an  example  of  fin  item  form,  and 
then  briefly  discuss  the  constituent  parts.    Then,    f7»t,r   more  examples  of 

** 

item  forms  are  provided.      The  first  one  comes  from  a  paper  by  Hively  el  ±t 
(1968)  and  the  other  three  are  from  a  book  b>  Hively  and  his  associates 
(1973)1. 


to 


Permission  for  duplication  of  these  materials  in  our  final 
report  is  pending. 


 i 


Example  One 


ITEM  FORM  2,2  • 

Producing  examples  of  iimylc  and  non-simple, 
open  and  closed  curvci. 


fitKIIAl  DESCRIPTION 

Thi  child  it  ftvtn  an  tu^nplt  of  a  simple  open,  simple 
closed.  non»simpie  open,  or  non-simpit  dosed  curve  and  asked 
(o  draw  several  more  that  art  different,  but  of  the  umt  bwdr 


STIMULUS  AMP  ItSPOMSE  CHARACTERISTICS 

Constant  for  AH  Cells 

Child  it  given  an  example  of  thi  required  type  of  curve  at 
the  beginning  Child  produces  curves  by  drawing  them, 
Distinguishing  Among  Cells 

Type  of  cunrj  required:  (l)  simple  open,  (2)  simple  closed, 
(3)  non-simple  open,  (4)  non-simple  closed.  (The  lest  two 
curve  types  ore  not  standard  topological  classifications,  but 
are  clearly  defined.) 
Varying  Within  Cells 
instances  of  sample  curves  presented. 


ITt*  CORK  IHCU 


ClU  MATRIX 


Script  (b) 


Simple  doted 

(1) 

Simple  open 

(2) 

Non-simple  dated 

(3) 

ftoixsimpie  open 

(4) 

(Sample  curve  >i  drawn  from  replacement  sat 
corresponding  to  script.) 


1  Originally  developed  by  Stephen  lundln. 


MAT til All 

Curve  cordfio)  Q  ] 

lespftAse  Sheet 

Ptricti 


OlItCTiONS  TOE 

Don't  lot*  at  curve 
card  yourself,  until 
you  have  laid  it  ift 
frontof  S, 


Alter  S  finishes  etch 
answer,  write  its 
number  beside  it. 

if  yog  aren't  sure 
whsiher  S  it  fin- 
ished, ask  him. 


in  transition  to  each 
new  question,  you 
can  say  "urn  hum" 
or  0  X."  but  don't 
sty  "good"  or 
otherwise  put 
special  emphasis  on 
correct  answers. 


ITEM  f 0AM:  3.2 

cm.  [T]] 

ACfMCATlONi  [T] 


(b)  simple  dosed 


cwvve. 


SCRIPT 
Mere  Is  e 
Here  is  i  pencil  end  paper.  Drew 
another  j  fb)  simple  closed  |  curve 
that  ^  different  from  that  one. 
(Answer  ■!) 


Now,  draw  another  fb)  Simple 
that  ts  different.  closed 


SAfttwdr  »2) 


New,  draw  another 


(6)  Simple 
closed 


(Answer 


Now  draw  another 
that  is  different. 


fb)  simple 
dosed 


(Answer  M) 


iHtOIQiNG 

AttKh  response  sheet. 


ItPUCEMCNT  SCHEME 

Curve  Cards  (a) 
Cell  1:  choose  from  I  S.  2.1. 
CHI  2i  choose  from  1,5.  2.2. 
Cell  };  choose  from  I  S  2.3. 
Cell  4i  choose  from  2.4. 
REPLACEMENT  IETI 


Script  (b) 

Celt  1:  simple  dosed 
Ceil  ?i  ttmpie  open 
Cell  3:  non-simple  closed 
Cell  4-  non«simpie  Open 


i  i  2.1. 

Simple,  closed  curvet 


I  S.  2.2. 

Simple,  open  curves 


■■bTjiolft:  LXlXiS^ 

I  S.  2.3.  R.S.  2.4. 

HM-iimple,  closed  curves     Non-simple,  open  curves 

SCCIIHI  gPtClflCATlOHS 

Call  I  (sdnpie  dosed):  Curve  bcunds  an  area,  may  not  haw 

crossing  points. 
Cell  2  (simple  open}:  Curve  does  not  bound  an  area,  may 

not  have  crossing  points. 
Cell  3  {non. simple  closed);  Pari  of  curve  bounds  an  area. 

must  have  at  least  one  crossing  point. 
Cell  4  (non*s(mp(e  open):  No  part  &  curve  bounds  an  area. 

mutt  have  at  toeet  one  creating  point. 


9b 


BEST  GC;-V  uL'.  :?£ 


o 

ERIC 


What  follows  is  a  short  description  of  each  of  the  constituent  parts;  these 
descriptions  were  edited  frfim  Hively  et  al. ' s  Tiller ia Is. 

Item-Form  Shell  —This  element  contains  the  common,  unvarying  com- 
ponent of  all  items  that  could  be  generated  by  the  item  form.      The  blank 
spaces  in  the  skill  are  filled  in  according  to  the  specifications  in  the 
Replacement  Scheme.     Instructions  to  the  examiner  are  placed  here,  and  .these 
instructions  specify  materials,  directions/script  and  recording. 

Replacement  Scheme  —This  element  specifies  how  to  choose  values  or 
prescriptions  for  each  of  the  variable  parts  of  the  item  form.  Replacements 

6 

specified  in  this  section  come  from  the  Replacement  Set. 

Stimulus  and  Response  Characteristics  —These  descriptions  are  in- 
tended to  describe  and  justify  whatever  behavioral  analysis  may  underlie 

the  properties  or  characteristics  utilized  in  structuring  the  domain  of  items. 
...  »  ■* 

Cell  Matrix  —  This%  element  does  two  things  :  (D  provides  a 

summary  of  the  information  found  under  Stimulus  and  Response  Character- 
istics and  (2)        assigns  an  identification  number  to  each  cell  to  coincide 
with  the  cell  numbers  used  in  the  replacement  scheme. 

Scoring  Specifications  —This  section  describes  the  properties  to 
be  used  to  distinguish  between  correct  and  incorrect  responses.  ^ 

Now  consider  the  four  additional  examples  of  item  forms  on  the 
next  pages. 


Example  Two 


-64- 


Dcscripiivc  Tilly     Sample  Item     CieiKr.il  l-orm 


Basic  fact; 
minuend  >  10. 

Simple  borrow; 
one-digit 

snbtrahend. 
He  n  ow  across  0 


13 
-6 

53 
.  7 


403 
-138 


A 
-B 

A 

-B_ 

A 
-B 


Equation; 

missing 

subtralicnd. 


42— 


:2$    A—  .=B 


Generation  Rules  ■ 

1.  A- la;  B=r:b 

2.  (a<b)  c  U 

3.  m, 

1.  Arraxan;  Birb 
*2.  a^U^lf  V_  

3.  (b>aL.)  c  U0 

1.  N(  j3,^ 

2.  A  '  -  a,a.j  .  .  .  ;  B  ~  b,b^  .  . 

3.  (a^bj),  (an<b;l), 
(a,  >  b,)  e  Uo 

4.  b2  €  Uo 

5.  aa=0 

1.  A—a^.;  B^bjb-j 

2.  at  <  U 

3.  aJtJbM  b,<  c  Uo 

4.  Check:  0<B<A 


:\v  <\2.  etc..  represent  digits. 


nu\ms  continue 


^Explanation  of  notation: 

Capital  letters  A,  B.^  .  .  represent  numerals. 
Small  letters  (with  or  without  subscripts)  a,  K 

x  t  {  -  •  -  }:  Choose  at  random  a  replacement  for  x  from  the  given  set.^ 
a,  b,  c.  €  4  •  ■  -  }:  All  of  a,  b,  c  are  chosen  from  the  given  bet  with  replacement. 
NA.  Number  of  digits  in  numeral  A. 
N:  Number  of  digits  in  each  numeral  in  the  problem. 
ap  a.*,  .  .  .  <  {  -  •       Generate  all  the  a,  necessary,  In  general 
the  pattern  established. 

(a<b)  c  \  •  -  -  }:  Choose  two  numbers  at  random  without  replacement:  let  u  be  the 
smaller. 

{H.  V};  Chtfosc  a  hoti/ontal  or  vertical  formal. 

H<A,  B,  .  .  .  \:  Choose  a  permutation  of  the  elements  in  the  set.  (If  the  set  consists  ol 
sul^cripis,  permute  iho»e  subscripted  elements.) 

Set  operations  are  u-cd  as  normally  defined.  Note  that  A  —  IJ  A  —  B  Ordered  n^irs 
arc  also  used  as  usual. 

Check:  If  a  cheek  is  not  fulfilled,  regenerate  all  elements  involved  in  the  thrik  statement 
(and  any  elements  dependent  upon  them). 

Special  sets: 

U  -  {1.  2  9> 


11 


\0t  I, 


100 


9 

ERIC 


BEST  c&v  n*,.v, 


Example  Three 


ITEM  FORM  9.7#  , 

Producing  a  number  satisfying  a  given  order  re- 
lation to  specified  nqmbers(s)  (spoken  form). 

QENERAL  DESCRIPTION 

The  child  Is  asked  to  say  the  name  of  a  number  that  wars 
m  sneufied  order  relation  rgrtater  than"  or  "less  than  )  to 
5  liven  number  or  number  in  the  range  0  through  20.  Given 

•  Jtumbers  are  presented  In  spoken  form  and  response  Is  spoken. 

STIMULUS  AND  RESPONSE  CHARACTERISTICS  ■ 
Constant  for  All'Cells 

The  presentation  is  completely  spoken;  a  spoken  response 
is  required. 

Distinguishing  Among  Cells     t  * 
Three  «riots  are  used  asking  respec/vely  for  a  number 

*  greet? thSn  i  given  number,  for  a  number  less  than  ,  given 
number/ and  for  a  number  greater  than  one  given  number 
and  less  than  another.  * 

uusihin  t)i«  thirri  seriot  three  cond'tions  are  anoweo:  \i) 
r,r«  liven  numeral  greeter  than  second  with  required  num. 
bWW^^  l"it  given  num.r»l  vtjttr  t  a" 

second  with  required  number  necessarily  not  en  Tntega :  and 
(3)  first  given  numeral  less  than  second  so  that  the  solution 
to  the  problem  is  the  empty  set. 
Varying  Within  Cells 

withm  each  cell  the  given  numbers  are  Integers  from 
thf  range  o  through  20  chosen  so  that  the  correct  nupenu 
(when  it  is  not  the  empty  set)  can  be  a  real  number  from 
the  range  0  through  20.  ^ 

CELL  MATRIX 


ITEM  rOHM  SHELL 


MATERIALS 
None 

DIRECTIONS  TO  I 

Read  script  to  child. 

write  down  child's  exact  • 
words. 

SCRIPT 

Tell  me  a  number  that  is 

REPLACEMENT  SCHEME 

(a)  Script 

Cell  It  "less  than  bi"  "greater  than  b,." 
Cells  3,4,5:  "greater  than  b»  but  less  than  b:. 

(b)  Numerals  within  Script 

Cell  It  Choose  bi  from  R.$.  9.1 

Ceil  2:  Choose  bi  from  R.S.  9.2 

Cell  3:  Choose  two  numbers  from  R.S.  9.3 

Cell  3:  Chooie  two  numbers  from  R.S.  9.3 

Let  bi  =  smaller  number:  bj  a  larger  number 

Reject  if  bs  -  b,  s  1 

Cell  4:  Choose  bi  from  R.S.  9.3 

Let  b$  =  bi  +  1 
Cell  5i  Choose  two  numbersfrom  R.S.  9.3 

Let  bi  -  larger  nurBer;  bi  =  smaller  number 

Reject  if  bi  =r  bj 

REPLACEMENT  SETS 

R.S.   9.1:  Whole  numbers  0.1.2  19. 

R.S.  9.2:  Whole  numbers  1,2,3.  .  .  .20. 
R.S.    9.3;  Whole  numbers  0,1.2  20. 

SCORING  SPECIFICATIONS 

Cell  l:  Any  real  number  x  where  X  >  b» 
Ceff  2:  Any  real  number  x  where  x  <  bt 
Cell  3t  Any  real  number  x  where  bi  <  x  <  b: 
Cell  4:  Any  real  number  X  where  bi  <  X  <  b; 
Cell  5i  Any  response  equivalent  to  saying  that  the*?<«re 
n0ium*er$  which  can  fulfill  the  conditions, 


Script  (a) 

"greater 
,  than  b" 

"less 
than  b," 

"grea 

ter  than  b,"  but 
than  b," 

les*'-' 

Numerals 

(b) 

0<b,vl9 

0,^20 

0£b,?18 
bl-2s"b;s20 

0<b^l9 
b2=bt  +  l 

lgbl§20 

121 

(3) 

'4) 

(5) 

■  Ofij'nalW  developed  by  Oonjld  Sermon. 


BfcSi  L. ::. 


9 

ERIC 


9. 


101 


Example  Four 


ITEM  FORM  16.14* 

Comparing  two  objects  on  equal-arm  balance 
and  choosing  a  symbol  to  complete  a  statement  of 
the  Weight  relation. 

GENERAL  DESCRIPTION 

The  child  is  asked  to  compare  the  weights  of  two  objects 
that  may  be  U)  Indistinguishable  by  hefting  but  easily  dis- 
tinguished on  the  balance,  (2)  Indistinguishable  even  on  the 
balance.  In  each  of  these  situations,  size  varies  as  an  irrele- 
vant dimension.  An  equal-arm  baiance  is  available  but  instruc- 
tions for  its  use  are  non-directive.  The  child  is  asked  to  select 
one  of  the  three  symbols  ( >,  <,  and  =  )  and  place  it  in  the 
blank' *ioce  provided  between  the  two  weight  symbols, 

STIMULUS  AND  RESPONSE  CHARACTERISTICS 
Constant  for  All  Cells 

Thq  equal-arm  balance  Is  of  similar  construction  to  that 
use:  in  MINNEMAST  Unit  16,  made  of  Tinkertoys,  cardboard, 
string,  a  metal  weight,  and  i  foot  ruler. 

The  objects  are  opaque,  cylindrical  bottles,  Identical 
except  for  weight  (either  23  gm.  or  25  gm.)  and  size  (either 
2"  x  or  2Vj"  x  l*i").  Each  is  identified^  a  lower- 
case  letter  assigned  at  random. 

The  child  is  asked  to  complete  a  symbolic  statement,  cor- 
responding to  the  weight  relation,  by  choosing  ihe  correct 
relation  symbol, 

Distinguishing  among  Cells 

Three  weight  relations  (detectable  by  balance  only,  not  by 
hefting  or  "feel")  defined  in  terms  of  the  location  of  the 
objects  when  ptac«d  in  front  of  the  child: 
left  >  right;  left  <  right]  left  -  right. 
Three  size  relations: 
left  >  right;  left  <  right;  left  =  right. 

CELL  MATRIX 

Weight  Relations 
(Detectable  by  Baiance  Only) 


I  TIM  FORM  SHELL 


Siro  Relations 

Wi  >  w 

W,  <  W, 

Wt  =  W, 

Si  >  S, 

(i) 

(4) 

(7) 

St  <  Sr 

(2) 

(5) 

(8) 

Si  =  Sr 

(3) 

*  Orlgmai'y  developed  by  wells  Hiveiy. 


MATERIALS 

Beam  Balance 
Objects  1  and  r 
from  T.O.  16.14.0 
Stimulus-Response  sheet 

(attached) 
Pencil 

DIRECTIONS  TO  E 

Piace  materials  In  front  of 
child.  (Keep  order  of  objects 
given  above.) 

•  ,  |  ,  Balance 

•  1  •  ♦-objects 

a  *-S-R  sheet 

Subject 

SCRIPT 

Here  are  two  objects.  They 
.  have  symbols  attached  to 
them.   Compare   them  by 
weight  and  write  one  of 
these  three  signs  (point)  in 
the  blank  (point)  to  form  the 
comparison  sentence. 

You  may  use  this  baiance  if 
you  need  to. 

RECOROING 

Attach  Stimulus-Response  sheet  to  this  page. 
Describe  what  child  did, 

If  balance  was  used,  Insert  object  symbols  In  schematic 
drawing  of  the  baiance  given  below,  and  mark  the  position 
of  the  plumb-line  at  the  time  of  child's  Jt  dgment. 


102 


DESCRIPTION  OF  MATERIALS 

Pencil  (T.O. 16.1.1)  ... 
Beam  Balance  (T.O.  16.13.1):  Equal-arm  beam  balance  made 
from  tinker-toy  materials  as  decrlbed  In  MINNEMAST  Unit  16. 

Set  of  Weight  Comparison  Objects  (T.O,  16,14.0)?  Set  of 
opaque  plastic  cylindcpbal  bottles  with  firmly  fitting  lids,  Two 
sizes  of  bottles  havr  r«en  chosen.  The  small  bottle  has  a 
length  of  2"  and  a  diameter  of  H",  The  large  bottle  has  a 
length  of  and  e  diameter  ofUV.  Two  weight  values 
have  been  chosen  so  that  the  objects  cannot  typically  be 
distinguished  by  hefting  but  can  be  distinguished  on  the 
baiance,  Each  object  Is  designated  by  a  randomly  chosen, 
lowercase  letter,  v 

Weight 


Size 

23  gm 

25  gm 

small 

a 

m 

k 

large 

b 

0 

n 

Stimulus-Response  sheet  (attached  to  Item)  (T.O.  16.14.1): 
a  sheet  of  paper  approximately  6"  X  4"  with  the  following 
display!  i 


Write  >,  <,  or  =s  in  the  blank 


Wu 


-W, 


where  l  and  r  are  the  appropriate  subscripts  (from  Replace* 
ment  Scheme).  \ 
CHE 


REPLACEMENT  S 

(l.r)  Objects 
Ceil  li 


f!EME 


Cell 
Cell 
Cell 

Cci 


Cell  6: 
Ceil  7t 
Cell  8s 
Cell  9: 
REPLACEMENTS  SETS 
R.S.  16.13 
16.14 
1G.1S 
16.16 
16.17 


(o,a) 

(m,b) 

Choose 

(b.m) 

(a.o) 

Choose 

Choose 

Choose 

Choose 


from  R.S.  16tl3 


from  R.S. 

from  R.S. 

from  R.S. 

from  R.S. 


16.14 
16.15 
16.16 
16.17 


R.S. 
U.S. 
R.S. 
R.S. 


Ordered  pairs 
Ordered  pairs 
Ordered  pairs 
Ordered  pairs 
Ordered  pairs 

SCORING  SPECIFICATIONS 

A  correct  response  Is  ma4e  by  writing  the  correct  symbol 
(>,  <,  or  =)  in  the  blank  space  to  complete  the  vomparison 
sentence.  This  should  be  >  m  Cells  1.  2,  and  3;  <  In  Ceils 
4,  5.  and  6;  r=  In  Cells  7,  8.  and  9. 


(m,a)» 
fa.mji 
Ib.aji 
(a.b)i 

{m,k)i 


(o.W 


o.m) 

w.o) 
(o.n) 


ERiO 


his1!  r,-5-:'^M,f«zt 


V  i 


Example  Five 


ITEM  FORM  26.2* 

Plotting  a  single  point  on  a  volume-weight 
graph. 

GENERAL  DESCRIPTION  ^ 

A  graph,  with  axes  indicating  volume  and  weight,  and  a 
sheet  displaying  either  an  ordered  pair  or  a  volume-weight 
chart  is  presented.  Tfce  child  is  asked  to  plot  the  point  rep* 
resented  by  the  data  onto  the  grid. 

STIMULUS  AND  RESPONSE  CHARACTERISTICS 
Constant  for  Ail  Celts 

The  grid  has  the  characteristics  described  in  the  De- 
scription of  Materials. 
Distinguishing  Among  Cells 

The  child  is  given  the  data  either  as  an  ordered  pair  or 
as  a  volume-weight  chart. 

The  data  are  such  that  the  point  to  be  plotted  is  either 
at  the  intersection  of  two  grid  lines,  or  on  an  X*axit  gru 
line  at  a  position  intermediate  On  tenths)  between  two 
Y-axis  grid  lines.  . 
Complete  crossing  of  these  categories  yields  four  ceils. ' 

varying  wiinin  Ceiis 

The  date  for  the  point  to  be  plotted  are  varied  within  the 
limits  of  the  grid  and  of  the  Cell  Constants  specifications. 

For  Celts  l  and  2,  the  volume  and  Weight  values  are 
both  chosen  from  'he  set  of  integers  1  through  12,  with  the 
requirement  that  the  two  values  must  not  be  identical. 
(This  condition  eliminates  situations  where  order  would 
not  matter.) 

For  Cells  3  and  4,  the  Volume  value  is  chosen  from  the 
set  of  integers  i  through  12;  and  the  Weight  value, 
Units  is  chosen  so  that  j  is  from  the  set  of  integers  0 
through  11,  and  X  is  from  the  set  of  integers  1  through  9. 

CELL  MATRIX 


Y-coordinate 
an  integer 

Y-coordinate 
in  tenths 

Date  as  Ordered  pair 

U) 

(3) 

Data  as  V/W  Chart 

(2) 

(4) 

ITEM  FORM  SHELL 


MATERIALS  • 

Stimulus  Sheet  (attached) 
Grid  (attached) 
Pencil 

DIRECTIONS  TO  E 

Place  materials  In  front 
of  child  and  point  to  .the 
relevant  parts  as  you  say: 

When  child  has  finished, 
attach  both  the  stimulus 
sheet  and  the  grid  to  this 
page. 

SCRIPT 

(d)  

DESCRIPTION  OF  MATERIALS 

Stimulus  Sheet  (attach  one  of  the  following  objects  to  the 

Item  as  specified  by  (a)  in  the  Replacements). 

(T.O.  26.5.1)*.  A  sheet  of  6"x4"  notepaper  displaying  the 
ordered  pair  P  [(b),  (c)); 

(T.O.  26.4.1)i  A  sheet  of  6wx4^  .notepaper  displaying  the 

following  labeled  chartt 


OBJECT 

VOLUME 
(In  units  of  volume) 

WEIGHT 
(in  units  of  weignt) 

P 

(b)     .  > 

(0 

Grid  (attached  to  Item)  fTO.  26.2.1):  A  sheet  of  paper  dis- 
playing a  grid,, 6"  X  6",  with  gridlines  W  *P*rt.  On  each 
axis,  the  grid  lines  are  marked  with  the  numbers  1  through 
12.  The  X-axis  is  labeled  "Volume  (in  units  of  voiu-.e)," 
and  the  Y-axis  ii  labeled  ''Weight  (in  units  of  weight)." 

Pencil  (T.O.  26.1.1): 

REPLACEMENT  SCHEME 

(a)  Stimulus  Shea 

Ceils  1  and  3:  T.O.  26.5,1 
Cells  2  and  4i         T.O.  26.4.1 

/* 

(b.c)  Coordinates  of  point  P  for  Stimulus  Sheet 
Ceils  1  and  2: 


Celis  3  arid  4 


Choose  b 
Choose  c 
Reject  if  b  =  c 

let  b  =  l 

C  - 

choose  i 
choose  j 
choose  K 


from  R,S  26.1 
from  R.S.  26.1 


from  R.S  26.1 
from  R.S  5,2 
from  R.S.  *6.3 


•  Originally  developed  by  Graham  Maxwell. 


103 


R'Hj 


9 

ERIC 


-68- 


2.7    flowchart  of  the  Process  , 

Figure  2.7.1  should  be  helpful  since  it  provides  a  summary  of  the 
steps  a  criterion-referenced  test  developer  must  consider  in  preparing 


domain  specifications. 


I. 


Figura  2.7.1.     Steps  for  preparing  domain  specifications. 


Zero  in  on  the  Behavior  to  be  Measured 


Select  from  Competing 
Domain  Alternatives 


104 


-69- 

■  2>8    Objective  Banks  <#  *  -.  * 

There  are  presently  available  a  number  qf  commercially  prepared 

sets  of  objectives  that  can  be  utilized  by  a  criterion-referenced  test 

i 

developer    in  his/her  work.    The  test  developer  may  find,  however,  that 
he/she  must  work  further  with  these  objectives  as  they  are  usually  stated 
in  behavioral^ terms,  and  lack  sufficient  clarity  to  permit  a- clear  deter- 

mination  of  the  domain  of  test  items  intended  for  thfe  objectives  (see 
•  * 
,    section  2.2) •    These  sets  of  objectives  may  serve  as  an  excellent  start- 
ing point  in  the  development  of  domain  specifications* 

The  addresses  of  several  of  the  better  known  organizations  that  distribute 
objectives  sets  are  given  below.    These  organizations  provide  complete  cata- 
logs  of  subject  areas  for  which  objectives  have  been  prepared,  and  certain 
of  iihe  organizations  provide  listings  of  supplemental  services  that  can.be 
used  in  conjunction  with  the  objectives  sets. 

Instructional  Objectives  Exchange  (IOX) 
Box  24095  ° 
'Los  Angeles,  California  900Z4 

Westinghouse  Learning  Press  Publications 
770  Lucerne  Drive 
P.  0.  Box  9035 

Sunnyvale,.  California  94086 


Other  organizations  distributing  objectives  (and/or  test  items)  can  be 
located  in  the  classified  ads  section  of  Phi  Delta  Kappan. 


105 


9 

ERIC 


.-70- 

2.9    Preparation 'of  Test  Specifications  4 

Any  good  test  requires  planning.   In  this  section,  we  will  discuss  a 
series  of  steps  that  will  aid  in  the  planning  process.    These  steps  are 
involved  with  the  preparation  of  a  set  of  test  specifications.     In  turn, 
the  test  specification  stage  may  be  viewed  as  the  initial  sf  ep.  :fn  the 
development  of  a  criterion-referenced  test.     Thus,  we  will  „be  discussing  a 
series  of  steps  that  will  aid  in  the  subsequent  development  of  a  criterion- 
referenced  test.    According  to  Tinkelman  (1971),  whose  work  ^e  have  utilized 
extensively   in  preparing  this  section: 

♦ 

The  essence  of  initial  test  planning  is  establishing 
the  test  specifications;  that  is,  the  sum  total  of 
the  qualities  and  characteristics  that  .the  test 
should  possess. 

The  following  list  of  steps  to  guide  in  the  development  of  test  sped- 

*  * 

fications  was  taken  from  Tinkelman  (1971)  and  adapted  to  fit  a  discussion  of 
criterion-referenced  tests.     We  present  the  steps  first,  and  then  comment  on 
them.     The  reader  will  find  that  certain  of  the  steps  have  been  covered  in 
detail  in  other  sections. 


Steps  in  Developing  Test  Specifications 

1.  Define  the  general  purpose    and  requirements  of  the  test. 

2.  Establish  the  specific  scope  of  the  test  as  expressed  by 
the  domain  specifications  or  item  forms. 

3.  Select  appropriate  item  types. 

4.  Determine  the  appropriate  nuirbcr  of  test  items  to  be  used. 

5.  Establish  how  items  an'e  to  be  assembled  in  the  test. 

6.  Prepare  item-writing  and  item-review  assignments. 


9 

ERIC 


106 


.-71- 


Steps  2  and  3  were  discussed  in  detail  in  earlier  sections  of  the  unit. 
The  other  steps  will'  be  given  greater  emphasis  in  the  ensuing  disJ 

0  , 

cussion. 


Step  1:    Define  the  general  purpose 

and  requirements  of  the  test. 


According'  to  Tinkelman  (1971),  the'tftst  developer    should  try  to 
answer  the'  following  questions  in  order  to  clarify  his/her  general  purpose 
for  testing:  *  \^ 

♦  1.    What  -specific  content  areas  are  to  be  measured? 

2.  Who  is  to  be  tested?  *  x 

\ 

3.  How  are  the  test  scores  to  be  used? 

4.  What  are  the  time  limitations  on  testing? 

5.  Will  there  be  a  need  for  equivalent  forms?  - 
A  number  of  other  possible  questions  can  be  asked;  the  point  of  the  process 
is  to  get  the  test  developer  to  zero  in  on  what  the  purpose  and  requirements 
of  his/her  test  are  going  to  be. 


\ 


Step  2i    Establish  the  specific  scope  of  the 
test  as  expressed  by  the  domain 
-    specifications  or  item  forms. 


A  great  deal  has  been  written  in  this  unit  on  the  development  of 
domain  specifications  (sections  2.3.  and  2.4)  and/or  item  forms  (sections 
2.5  and  2.6).     The  only  point  to  be  made  here  is  that,  -in  the  overall 
context  of  the  development  of  a  set  of  test  specifications,  the  domain 
specification  phase  Is  the  second  step  in  the  process. 


107 


Step  3:     Select  appropriate  item  types. 

t 

This  step  has  been  discussed  in  detail  in  section  2.10  and  also  in 
section  4.3.    According  to  Tinkelraan,  the  item  type  or  item  types  to  be 
chosen  should  be  considered  in  reference  to: 

1.  the  domain  specifications  or  item  forms 

2.  possible  scoring  procedures 

3.  administrative  features 

4.  printing  requirements. 

The  list  is  in  order  of  priority;  however,  a  consideration  of  all  four  may  be 
necesary  before  making  a  final  decision  on  which  item  type  or  types  to  use. 
It  should  be  pointed  .out  again  that  first  and  foremost,  the  item  type  chosen 
must  be  such  that  the  items  do  indeed  "tap"  the  behavior  specified  in  the 
domain  definition.      All  other  considerations  are  secondary. 


Step  4:    Determine  the  appropriate  number 
of. items  to  be  used. 


At  this  point  in  the  planning  process,  the  test  developer  is  trying  to 
get  an  indication  of  the  number  of  test  items  that  will  be  needed.  This 
then  will  have  a  bearing  on  the  number  of  items  that  item  writers  need  to 
construct.     Four  areas  should  be  considered  in  making  tentative  decisions 
about  number  of  items: 

1.  The  relationship  of  numbers  of  items  to  the  importance  placed 
upon  the  doaiain  in  the  curriculum. 

2.  The  relationship  of  the  numbers  of  items  to  minimum  reliability 
requirements. 

3.  The  relationship  of  the  number  of  items  to  time  limits. 

4.  The  relationship  of  the  number  of  items  to  item-review  mortality  ra 


-73- 

In  terms  of  area  number  one,  it  may  be  the  case  that  certain  areas 
of  a  curriculum  have  been  stressed  more  than  others  in  the  instructional 
process.     If  the  test  developer  plans  for  the  test  to  cover  multiple  domains 

he/she  should  then  plan,  when,  drawing  samples  of  items  from  each  domain, 

t 

to  more  heavily  sample  the  most  important  domains.    Such  a  decision  is 
situation  specific,  and  little  moie  can  be  said  in  terms  of  overall  guide- 
lines. 

In  reference  to  area  two,  the  relationship  of  the  number  of  items  to 
minimum  reliability  requirements,  guidelines  are  presently  being  developed. 

•■A 

As  discuss^  in  the  section  of  unit  5  on  reliability,  the  Spearman-Brown 
formula,  which  relates  test  length  to  reliability,  is  reasonable  to  use  only 
for  norm-referenced  tests.    Similar  relationships  need  to  be  developed  for 
two  important  uses  of  criterion-referenced  test  scores,  domain  score  estima- 
tion ard  assignment  of  examinees  to  mastery  states.    The  following  proce- 
dure should  be  helpful  to  those  in  the  planning  process  for  determining 
test- .length  when  domain  score  estimation  is" the  problem  of  interest.  The 
solution  is  a  conservative  one,  i.e.,  test  lengths  determined  by  this 
method  will  be  a  little  longer  than  they  need  to  be  to  obtain  the  degree 
of  precision  required  by  the  test  developer.    The  formula1  is: 


 .25  

Test  Length  =      (degree  of  precision)' 


Ask  yourself  (or  interested  others):    What  degree  of  precision  is 

required  of  the  domain  score  estimates?    Discuss  the  degree 
of  precision  question  in  the  same  way  you  would  the 
standard  error  of  measurement.    A  primary  difference  be- 
tween the  two  is  that  domain  score  estimates  are  defined 
on  a  scale  [0,  1] . 

^fhe  formula  can  be  derived  from  the  binomial  test  model. 


109 


-74- 


Example 

Suppose  you  felt  that  an  error  of  t  10%  could  be  tolerated; 
then,  degree  of  precision  =  .10;  and,  using  the  equation  above, 
test  length  =25. 


There  is  one  other  important  consideration:     Item  review  mortality 
rate.    You  must  try  and  estimate  the  percentage  of  items  that  are  likely 
to  be  discarded  in  the  review  process.    Ask  yourself:    How  experienced  are 
my  item  writers?    A  rejection  rate  in  the  neighborhood  of  20%  would  not  be  unusual. 
This  figure  may  seem  especially  low.     It  certainly  is  by  norm-referenced 
test  development  standards,     but,  the  "standards"  for  a  good  criterion- 
referenced  test  item,  while  being  difficult  to  meet,  do  not  depend  on 
desirable  "statistical  properties",  something  which  is  important  for  norm- 
referenced  test  items,        This  is  something  that  is  very  hard  to  predict  in  ad- 
vance by  norm-referenced  item  writers.    Hence,  we  have  a  good  explanation  for  the 
relatively  higher  rejection  rate  of  items  prepared  for  norm-referenced 
tests  than  criterion-referenced  tests.  t 

Continuing  the  example,  if  in  the  judgment  of  the  test  developer, 
about  20%  of  the  item  pool  for  an  objective  is  apt  to  be  poor,  then  we 
must  write  about  3*  test  items.     (Solution:     Let  the  number  of  test  items 
prepared  be  X.     If  X  -  (20%  of  X)  -  25f^hen,  X  -  31.)     Item  writers 
would  need  to  prepare  about  31  test  items. 

Two  points  seem  worthy  of  mention  at  this  point,    One,  it 


110 


9 

ERIC 


is  unlikely  that  fewer  than  five  or  six  items  measuring  an  objective  will 
produce  desired  levels  of  reliability.    Two, .while  no  tables  or  formulas 
exist  to  connect  test  length  to  reliability  (or  consir.   ,icy)  of  decision- 
making, this  can  be  studied  empirically  »ifu..  uUmiuibLrui.  .on  of  ^ool 
of  test  itemc.    "Post-hoc"  test  forms  of  varying  lengths  can  be  constructed 
and  reliability  estimates  may1  be  calculated,  on  the  assumption  that  ex- 
aminees would  have  responded , in  the  same  way  had  they  been  presented  with 
the  "parallel-forms"  iather  than  a  single  large  pool  of  test  items.  By 
varying  the  length  of  the  forms  ana  the  formation  of  parallel-forms  (i.e., 
which  items  are  placed  in  which  forms),  the  relationship  between  test 
length  and  reliability  for  a  specified  sample  of  examinees  for  a  pool  of 
test  items  measuring  a  particular  domain  specification  can  be  studied. 

In  reference  to  area  three,  it  really  goes  without  saying  that  the 
number  of  items  needed  is  determined  by  time  limits.    However,  it^should^ 
be  noted  that  this  also  depends  on  the  item  type  or  types  chosen  in  step  3. 
For  instance,  more  true-false  questions  can  be  asked  in  a  particular  time 
period  than  completion  questions.    We  can  offer  few  general  guidelines; 
the  decision  will  depend  upon  the  content  area,  the  students  tested,  the 
item  type(s)  selected,  and  the  total  time  available. 

Finally,  the  number  of  items  needed  is  dependent  on  the  item-review 
mortality  rate,  that  is,  the  number  of  items  that  can  be  expected  to  be 
rejected  either  because  of  technical  flaws  or  content  validity  problems. 
Clearly,  the  determination  of  a  figure  will  be  situation-specific. 


Step  5:     Establish  how  the  items  are  to  be 
assembled  in  the  test. 


The  material  in  section  4.5  is  relevant  in  considering  this  step  of 
the  test  specification  procedure.     Therefore,  the  reader  should  refer  to 
section  4.5  for  details.  Ill 


i 


-76- 


Step  6:    Prepare  the  item-writing  and  item- 
review  assignments 


The 


item-writing  assignments  can  be  handled  in  three  general  ways: 
An  item-writer  can  concentrate  on  developing  items  for  a 


single  or  a  few  domain  specifications;  or 

2.  if  the  writer  has  an  item  type  specialty,  he/she  can 
concentrate  on  the  same  item  types  across  domains;  or 

3.  if  item-writer  "staleness"  becomes  a  problem,  the  item-  „ 
writer  can  work  on  a  number  of  domains. 

Of  course,  many  offshoots  of  these  very  general  guidelines  are  possible. 
Further,  if  the  test  developer  is  also  the  item  writer,  which  is  often  the 
case,  then  the  guidelines  above  are  of  little  use. 


ERIC 


-77- 

2.10  Preparation  .  \  Test  Items 

In  preparing  criterion-ref arenced  test  items,  one  important  point 
should  be  kept  in  mind:    Close  attention  must     b*  givon  to  the  domain 
specifications  at  all  times,  thereby  insuring  that  the  test  items  "tap" 
behaviors  in  the  domain  of  behaviors  defined  by  the  specification.  With 
this  in  mind,  the  materials  included  in  this  section  should  help  the 
reader  to  choose  the  relevant  type  of  test  item  to  suit  his/ ler  purpose,  .. 
and  then  to  make  sure  that  proper  item  writing  principles  are  followed 
in  item  preparation. 

t 

Fortunately,  a  considerable  amount  of  literature  exists  to  (1)  intro- 
duce practitioners  to  available  item  formats,     (2)  help  practitioners 
select  the  "best"  item  formats  to  measure  particular  objectives,  and 
(3)  train  practitioners  to  write  "good"  test  items  ("good"  in  the  sense 
4$  of  being  technically  correct  and  measuring  the  intended  objective).  The 

interested  reader  is  referred  to  section  2.12.3  for  an  excellent  selection 
of  references.     In  this  section,  several  summaries  are  included: 

1.  Types  of  item  formats  (a  list  of  item  formats  for  objectively 
and  subjectively  scored  test  items), 

1 

2.  Definitions  and  appropriate  item  formats  for  objectives 
classified  into  different  levels  of  Bloom's  Taxonomy  of 
Educational  Oblectives,  Cognitive  Domain, 

3.  Some  differences  between  essay  and  objective  tests  (a  compari- 
son of  these  two  common  types  of  tests  on  nine  dimensions), 

4.  Principles  of  item  writing  (a  list  of  questions,  organized 
by  item  format,  concerning  the  quality  of  test  items), 

5.  Scoring  of  objective  and  essay  test  items. 


113 


9 

ERIC 


-78- 


1/1/4/79 


Item  Formats 


Objectively  Scored  Items 


Multiple-Choice 

Matching 

True-Fals6 

Short  Answer/Completion 


Subjectively  Scored  Items 


►Essay 


Performance 


114 


9 

ERIC 


Definitions  and  Appropriate  Item  Typeo 
for  Objectives  Classified  into  Different 
Levels  of  Bloom's  Taxonomy  of  Educational 
Objectives,  Cognitive  Domain 


Category 
Knowledge 


Comprehension 


Application 


Definition  of 
the^Ability 
Involved  

Knowledge  of  specifics,  , 
terminology,  specific 
facts,  ways  and  means 
of  dealing  with  speci- 
fics ,  conventions , 
trends ,  sequences , 
classifications  and 
categories,  criteria, 
methodology,  univer- 
sals  and  abstractions 
in  a  field,  principles 
and  generalizations, 
and  theories  and  ' 
structures* 

A  type  of  understanding 
such  thf.t  the  person 
knows  what  is  in  a  mes- 
sage and  can  use  the 
information  without  con- 
necting it  necessarily  to 
other  pieces  of  informa- 
tion or  understanding  'the 
fullest  implications  of 
the  message. 

Invo'Lves  the  use  of  ab- 
stractions (e.g.,  rules 
or  ideas)  in  concrete 
situations. 


Verbs  Typically 
Used  to  Describe 
Objectives  

define 

describe 

identify 

recall 

recognize 

name 

state 

recite 

write 

acquire 

label 

list 


translate 

transform 

give  in  own  words 

illustrate 

prepare 

rephrase 

restate 

represent, 

explain 

interpret 

apply 

generalize 

relate 

develop 

organize 

use 

transfer 

demonstrate 

compute 

solve 

produce 

employ 


Possible 
Test  Item 
Formats 

Multiple-Choice 
Matching  ^ 

True-False 


Multiple-Choice 

Matching 

True-False 


Multiple-Choice 
Matching 


-80- 


Analysis 


Synthesis 


Evaluation 


Definition  of 
the  Ability 
-Involved 

Decoding  communication 
into  the  proper  ele- 
ments so  as  to  reveal 
their  relationships. 


Placing  elements  to- 
gether to  form  a 
whole,  when  the ; whole 
was  not  clear  before. 


Determining  the  worth 
of  some  material  for  a 
given  purpose  or  use. 


Verbs  Typically 
Used  to  Describe 
Objectives  ? 

distinguish 

classify 

discriminate 

analyze 

contrast 

deduce 

subdivide 

identify 

differentiate 

compile 

categorize 

create 

summarize 

arrange 

write 

tell 

modify 

specify 

produce 

combine 

synthesize 

categorize 

create 

organize 

judge 

assess 

decide 

compare 

contrast 

standardize 

appraise 

criticize 

conclude 

interpret 


Possible 
Test  Item 
Formats 

Short  Answer 
Completion 
Essay 


r 


Short  Answer 
Completion 
Essay 


Completion 
Essay 


116 


-81- 


1/14/79 
(Fourth  Draft 


Some  Differences  Between  Essay  &.J  ObjecLj  u  I'ests 


Essay 

1,  Student  plans  his/her  own 
answer  and  expresses  his/her 
own  beliefs* 

2,  The  test  includes  relatively  few, 
usually  general  questions,  calling 
for  extended  answers.     It  covers 
less  of  the  curriculum,  but  the 
part  covered  is  in-depth, 

3,  Thinking  and  writing  time  is 
needed, 

4,  The  quality  of  the  test  is 
determined  mainly  by  the  skill 
of  the  person  grading  the  paper. 

5,  The  test  is  relatively  easy  to 
prepare,  but  tedious  and  difficult 
to  score, 

6,  Much  freedom  is  given  to  the  student 
to  express  his/her  ideas  in  his/her 
own  words. 


7.  The  student  can  bluff, 

8.  It  is  less  clear  to  the  student 
what  is  expected  in  an  answer, 

9.  The  distribution  of  test  scores 
is  determined  by  the  person 
grading  the  papers. 


Objective 

1,  Student  selects  an  answer  or 
provides  a  short  answer, 

2,  The  test  contains  many  specific 
questions  requiring  brief 
answers.     It  covers  more  of  the 
curriculum,  but  it  is  in  less- 
depth.  ;/ 

3,  Thinking  and  reading  time  is 
needed, 

4,  The  quality  of  the  test  is 
determined  mainly  by  the  test 
constructor, 

5,  The  test  is  tedious  and 
difficult  to  prepare,  but 
easy  to  score, 

6,  The  test  constructor  has  freedom 
to  express  his/her  own  values 
and  preferences.    The  student  has 
freedom  to  show,  by  his/her  score 
knowledge  of  test  content. 

7,  The  student  can  guess.  * 

8,  It  is  quite  clear  what  is 
expected  of  the  student. 

9,  The  distribution  of  test  scores 
is  determined  by  the  test 
constructor. 


*From  Hambleton,  R.  R. ,  and  Fitzpatrlck,  A.     Review  techniques  for 
criterion-referenced  teat  items.     (In  preparation) 


117 


-82- 


Objectively-Scored  Item  Writing  Principles1 
General  Principles 

1.  Assess  only  a  single  piece  of  knowledge  or  skill  in  a  test  item, 

2.  Test  item  readability  should  be  at  level  appropriate  for  the  examinees 
being  tested. 

3.  Avoid  the  use  of  "trick11  test  items  or  test  items  measuring  minor  or 
insignificant  points. 

4.  Always  identify  the  source  of  opinions  or  quotes  used  in  test  items. 

5.  Avoid  measuring  knowledge  or  skills  in  a  test  item  which  are  extraneous 
to  those  which  the  test  item  was  written  to  measure. 

6.  Remove  suoerfluous  words  or  complications  in  a  test  item  which  will 
introduce  irrelevant  factors  into  examinee  test  performance, 

7.  Test  items  must  be  written  clearly. 

8.  Test  items  must  be  constructed  in  accord  with  standard  rules  of  punc- 
tuation and  grammar. 

9.  Negatives  should  be  underlined  or  highlighted  in  some  way. 

10.  Avoid  the  use  of  words  which  give  clues  to  correct  answers. 

11.  A  test  item  must  have  one  correct  or  clearly  best  answer. 

12.  Examinees  who  have  the  skill  or  knowledge  measured  by  a  test  item 
must  answer  it  correctly. 

13.  Insure  that  the  correct  answers  follow  a  random  pattern. 

14.  Have  content  and  measurement  specialists  review  test  items  to  eliminate 
ambiguity,  technical  errors,  and  other  item  writing  errors. 

JWe  would  like  to  thank  Anne  Fitzpatrick  for  assistance  in  this 
section  of  the  Unity 


118 


/ 

Writing  Multiple-Choice  Test  Items 


Item  Stem  Content: 

1.  Has  new  material  been  used  in  the  item  if 
it  measures  students'  understanding  or 
their  ability  to  apply  principles? 

2.  Could  the  item  be  better  expressed  as  a 
series  of  true-false  questions? 

3.  Is  the  content  of  the  test  item  reflective 
of  the  domain  specification  the  item  was 
prepared  to  measure? 

4.  Does  the  item' stem  clearly  define  a 
problem? 


Item  Stem  Structure: 

i> 

1.  In  the  item,  is  as  much  material  as  possible 
included  in  the  stem  so  that  the  options 
are  as  short  as  possible? 

2.  Have  all  repetitive  words  or  phrases  been 
placed 'in  the  item  stem  rather  than  in  the 
set  of  answer  ctioices? 


Response  Content: 

1.  Is  there  only  one  correct  or  one  best  answer 
to  each  item? 

•+ 

2.  If  the  best"  answer  form  is  used,  are  the 
distractors  clearly  less  correct  than  the 
"best"  answer? 

3.  If  the  correct  answer  form  is  used,  are  the 
distractors  of  an  item  clearly  incorrect? 

4.  Will  all  the  distractors  to  the  item  be 
plausible  to  those  who  do  not  possess  the 
skill  measured  by  the  item? 

5.  Does  the  set  of  answer  choices  for  the  item 
contain  a  vocabulary  or  reading  Iqad  which 
will  act  as  irrelevant  sources  of  difficulty? 


119 


-34- 


Yes 


No 


Unsure 


Response  Structure: 

1.  Is  the  number  of  distractors  for  the  item 

# appropriate  to  the  ages  of  those  being  tested? 

2.  Have  possible  answers  such  as  "all  of  the 
^above"  and  "none  of  the  above"  been  avoided 
in  the  intern? 

3.  Does  the  item  contain  two  or  more  distractors 
which  overlap  or  mean  the  same  thing,  such 
that  an  examinee  could  eliminate  these  dis- 
tractors simultaneously? 

4.  Are  all  possible  answers  to  an  item  similar 
in  type,  concept  or  focus  so  that  they  are 
as  homogeneous  as  possible? 

5.  Are  all  possible  answers  of  the  item  gram- 
matically consistent  with  the  item  stem? 

6.  Do  the  answer  choices  of  the  item  have  the 
same  grammatical  form  so  that  they  are 
parallel? 

7.  Are  the  possible  answers  to  the  item 
similar  in  length  and  complexity? 

,  8.  Are  the  possible  answers  to  the  item 

listed  on  separate  lines  below  the  item 
stem? 

9.  Are  the  possible  answers  to  the  item 

arranged  in  a  logical  order  where  possible? 

10.  Are  letters  used  in  front  of  the  possible 
answers  to  identify  them? 

11.  Ha9  "don't  know"  been  used  as  an  answer 
choice? 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


✓ 


Directions : 

1.  Do  directions  to  the  test  clearly 

specify  whether  the  correct  or  whether 
the  best  answer  is  to  be  chosen? 


✓ 


9 

ERIC 


120  • 


-85- 


Writing  Matching  fest  Items 


Item  Stem  (Premise)  Content; 

1.  Do  the  matches  to  be  made  in  an  item  all 
reflect  important  aspects  of  the  subject 
material  to  be  tested? 

2.  Are  all  premises  clear  in  meaning? 

.3.  Are  the  premises'  of  the  item  clearly 
related  to  one  another  so  that  they 
are  as  homogeneous  as  possible? 

Item  Stem  (Premise)  Structure: 

1.  Are  all  the  premises  of  a  set  similar  in 
grammatical  form? 

2.  Does  any  set  of  premises  have  more  than 
8-10  elements? 

Response  Content; 

1.  Do  any  of  the  premises  plausibly  relate  to 
response  other  than  its  correct  match? 

2.  Are  the  responses  to  the  item  similar 
in  type,  focus  or  concept* so  that 'they 
are  as  homogeneous  as  possible? 

3.  Can  correct  matches  be  made  by  using 
only  logic  or  a  superficial  understanding 
of  the  subject  material? 


Response  Structure: 

1.  Are  there  more  responses  than  premises? 

2.  Are  the  responses  arranged  in  a 
systematic  order  wherever  possible? 


121 


-86- 


3.  Do  short  phrases,  words  or  numbers  make 
up  the  response  -list  of  an  item  whenever 
possible? 

4.  Do  the  responses  share  the  same  grammatical 

form? 


Directions:  ^ 

« 

1.  Are  there  headings  for  the  premise  and 
response  lists  of  an  item? 

2.  Do  the  directions  clearly  specify  the  basis 
on  which  matches  are  to  be  made? 

3.  Do  the ''directions  clearly  state  whether  a 
response  can  be  used  more  than  once? 

A.  Is  the  matching  exercise  presented  on 
a  single  page? 


122 


Item  Content:  «• 

1.  Can  it  be  said  without  qualification  that 
the  item  is  definitely  true  or  false? 

•    2.  Is  only  a  single  idea  expressed  in  the 
statement  comprising  the  item?  ; 

3.  Is  any  part  of  the  item  true,  while 
another  part  of  that  item  is  false? 

t 

Item  St:  v  '^re:     *  • 

1.  Is  the  item  statement  short? 

2.  Is  the  sentence  structure  of  the  item 
statement  simple? 

3.  Is  each  item  stated  as  concisely  as 
possible? 

4.  Does  the  item  contain  vague  words  like 
"seldom,"  "frequently,"  or  "generally"? 


Response  Content; 

1.  Could  a  person  use  simply  logd&  or  'common 
sense  to  identify  the  correct  answer? 

2.  Will  the  wrong  answer  to  an  item  be 
plausible  to  those  who  have  not  mastered 
the  subject  material? 


Directions; 

1.  Are  directions  included  which  clearly 
describe  how  examinees  should  answer 
the  items? 


123 


-88- 


Wrltlng  Short  Answer/Completion  Test  Items 


Yes 


No 


Unsure 


Item  Content; 

1.  Does  the  item  pose  an  important  rather  than 
a  trivial  question  about  the  subject  matter? 

2.  Is  the  item  written  so  clearly  that  there 
is  a  single  correct  answer  which  a  good 
student  will  know? 

3.  Is  the  item  written  so  that  a  brief  answer 
is  possible? 

4*  Is  the  meaning  of  the  item  made  unclear 
because  of  too  many  blanks  in  the  item? 


✓ 


✓ 


✓ 


Item  Structure:  « 

1.  Have  response  cues  or  specific  determiners 
such  as  "a"  and  "an"  or  singular  and 
plural  verbs  been  avoided  in  the  item? 


✓ 


Response  Content: 

1.  Is  the  student  asked  to  provide  only  key 
words,  phrases  or  sentences  in  response 
to  the  item? 

2.  Is  the  precision  desired  in  the  answer  to 
the  item  clearly  indicated? 

3.  If  the  item  requires  a  numerical  answer, 
are  the  units  of  the  answer  specified? 


124 


9 

ERIC 


-89- 


3*f 


9 

ERIC 


12 


Yes 


No 


Unsure 


Response  Structures 

1.  Are  the  blanks  which  the  student  will  fill 
i       in  placed  near  the  end  of  the  item? 

2.  Has  ample,  space  to  record  an  answer  been 
provided  in  the  item? 

3.  Are  the  answer  spaces  provided  for 
the  items  all  the  same  length? 


✓ 


✓ 


Directions: 

1.  Is  it  clearly  indicated  what  form  the 
answers  to  the  items  should  take? 

2.  Is  it  clearly  stated  whether  spelling 
and  grammatical  errors  will  be  scored? 

3.  Are  students  informed  of  how  thr 
answers  will  be  scored? 


✓ 


-90- 


Writlng  Kssay  Test  Items 


Yes 


No  Unsure 


Item  Content: 

1.  Does  the  item  pose  a  clear  task  for  the 
examinee  by  including  a  clear  specifica- 
tion of  the  scope  and  direction  desired 
in  aci  answer? 

2.  Are  students  asked,  in  the  item,  to  "compare," 
"contrast,"  "give  the  reason  for,"  "explain 
how,"  etc.,  rather  than  simply  state  "what," 
"where,"  "when,"  "who,"  or  "where"? 

3.  Does  the  essay  question  pose  a  new  problem 
to  the  examinees? 

4.  Is  the  item  a  general  or  broad  question 
which  could  be  better  expressed  by  several 
more  concise  questions?  • 


Item  Structure: 

1.  Is  the  question  of  unsuitable  length  or 
complexity  for  the  maturity  levels  of 
the  students? 


jL 


Directions: 


1.  Are  students  informed  of  an  appropriate 
amount  of  time  they  should  spend  on  each 
essay? 

2.  Are  students  informed  of t  the  number  of 
points  associated  with  each  essay  question? 

3.  Are  students  informed  of  how  their 
responses  to  the  items  will  be  scored? 


✓ 


✓ 


ERLC 


.4.  If  students  are  permitted  to  choose  which 
of  several  questions  to  answer,  are  these 
several  questions  equal  in  difficulty? 

5.  Has  an  "ideal"  response  to  each  question 

been  prepared  before  test  administration?  j^£>g 


✓ 


-91- 


Item  Scoring  < 

V 

Multiple  Choice  Test  Items 

The  most  coiumoniy  used  formula  for  correcting  for  guessing  is: 

CS  =  R  -  W/  (n-1) 

Where  CS  is  the  score  corrected  for  guessing 
R  is  the  number  of  correct  answers 

W  is  the  number  of  incorrect  answers  «. 

(not  counting  omitted  questions) 
n  is  the  number  of  choices  for  each  item* 

r 

Examples  ^ 


• 


1,  -  for  two  choice  questions,  n  =  2 

CS  =  R  -  W/  (2-1)  =  R-W. 

-  for  a  90  question  test  in  which  student  had  60  correct  answers,  10  wrong  and 

20  omits  -  CS  ■  60  -  10  =  50. 

i 

2.  Usually  we  have  5  choice  questions,  n  =  5 
then  CS  =  R  -  W/  (5-1)  -  R  -  W/4 

-  for  a  90  question  test  in  which  srudent  had  50  correct  answers,  30  incorrect 
and  10  omits 

CS  -  50  -  (30/4)  =  42.5 


127 

o 

ERJ.C 


.  -92- 


To  see  how  this  formula  corrects  for  guessing  -  suppose  a  student  takes 
a  5  choicc-90  question  test.    If  the  student  were  to  guess  at  each  question  - 
for  each  question  he  would  have  one  chance  in  five  of  obtaining  the  correct 
answer.    Thus  on  90  questions,  he  would  obtain  18  correct  answers  by  guessing 
(1/5  x  90  =  lfl).    Hence  he  would  obtain >2  incorrect  answers  (90  -  18  *  72). 
Applying  the  correction  -  for  -  guessing'  formula:  < 

-     CS  =  R  -  W/4 

«  18  -  (72/4)  =  0 

Now  suppose  a  student  is  able  to  answer  50  of  the  questions  on  the  basis 
.    of  knowledge  and  to  the  remaining  40  questions  he  guesses  randomly.    He  would 
obtain  50  correct 'answers  from  knowledge  and  8  more  by  guessing.     (If  he 
gucst.es  to  40  questions,  on  the  average  he  should  get  8  right  by  chance  - 
(1/5  x  40  ■»  8)  •    Thus  he  has  58  correct  answers  and  32  incorrect  answers  giving 
him  a  corrected  score  of: 

CS  =  58  -  32/4 
=  58-8 
«  50 

It  is  clear  then  that  guessing  at  random  will  not  improve  your  score 
when  the  correction  -  .for  -  guessing  formula  is  applied. 
Two  suggestions  are  proposed: 


It      Students  should  be  informed  of  the  correction  -  for  -  guessing  formula 
to  be  used  so  they  can  formulate  a  strategy  for  writing  the  test. 

2«      Use  of  the  correction  -  for  -  guessing  formula  is  very  important  when  the 
test  is  speeded  (when  most  people  do  not  finish  the  test)  because  it 
eliminates  a  large  amount  of  wild  guessing.     It  is  relatively  ineffective 
FR?r"       when  most  students  have  time  to  answer  all  of  the  test  questions. 

—  .    '  128  _  


-93- 

At  this  point  there  seems  to  be  little  evidence  to  recommend 
"correction-for-guessing"  scoring  with  critecion-referenced  tests. 
Generally,  there  seems  to  be  less  guessing  on  criterion-referenced 
testing  because  of  the  instructional  relevancy  of  the  tests.  Also, 
since  the  emphasis  ii  test  score  interpretation  is  not  on  a  comparison 
of  students,  there  is  less  pressure  on  students  to  achieve t high  scores. 
Finally,  when  instructional  decisions  are  to  be  made,  examinees  far 
from  a  cut-off  score  will  be  unaffected  by  a  correction-for-guessing 
formula.    For  those  examinees  near  the  cut-off  score  (which  is  usually 
in  the  region  of  70%  to  90%),  the  amount  of  guessing  will  be  minimal. 
Therefore,  there  seems  to  be  little  value  for  applying  a  "correction- 
for-guessing."    We  do  see  merit  however  in  two  suggestions: 

1.  The  "don't  know"  answer  should  be  considered  as  an 
an  answer  choice  to  reduce  the  effects  of  guessing, 

2.  Adjust  cut-off  scores       upward  to  reduce  the  chance 
that  examinees  will  "demonstrate"  mastery  because  they 
were  lucky  enough  to  guess  the  answers  to  a  few  questions. 


9 

ERIC 


12a 


-94- 
Scoring 


Yes       No  Unsure 


Short  Answer/Completion  Items 

1.  Has  enough  time  been  set  aside  for  scoring  the  tests? 

2.  For  each  item,  has  the  answer  or  set  of  answers  which 
should  be  considered  correct  been  identified? 

3. .Have  variations  of  the  correct  answer,  which  might 
be  considered  partially  correct,  been  identified? 

4.  Has  the  manner  in  which  correct  answers  will  be 
scored  been  identified? 

5.  Has  the  manner  in  which  partial  credit  will  be 
given  for  a  response  been  identified? 

6.  Has  a  scoring  system  for  each  aspect  of  a  response 
such  as  spelling  or  grammar  been  specified,  if 
these  qualities  are  to  -be  assessed? 

7.  Has  a  scoring  key  been  prepared,  if  needed? 

8.  Will  the  scoring  key  be  checked  against  a  random 
sample  of  completed  tests  to  make  sure  that  the 

key  accommodates  all  interpretations  of  each  question? 

9.  Will  people  who  have  mastery  in  the  subject  area 
of  the,  test  be  scoring  the  tests? 

10.  Will  a  complete  set  of  correct  answers  be  provided 
to  each  student  who  takes  the  test? 


Essay  Test  Items 

1.  Have  arrangements  been  made  with  two,  or  preferably 
more,  readers  to  independently  evaluate  each  of  the 
essays? 

2.  Will  the  readers  selected  be  skilled  in  the  content 
areas  to  which  the  essay  questions  relate? 

3.  Will  each  reader  be  asked  to  evaluate  all  reponses 
to  one  question  at  a  time? 

4.  Will  essay  readers  be  given  enough  time  to  grade 
all  responses  to  a  question  without  interruption? 

5.  Will  all  answers  to  a  question  be  as  anonymous 
as  possible? 


ERIC 


130 


-95- 


■  * 

Will  readers  be  advised  to  shuffle  all  answers 
to  a  question  before  beginning  their  evaluations? 

Has  a  uniform  grading  system  been  established 
which  %rtll  apply  to  all  responses  to  a  question? 

a.  Analytical  method: 

1.  Has  an  "ideal11  answer  to  each  essay  question 
been  prepared? 

2.  Have  the  contents  of  each  ideal  answer  been 
identified  and  listed? 

3.  Have  other  qualities  of  each  "ideal" * response 
such  as  logical  organization,  grammar,  support 
of  statements,  etc.  been  identified  and  listed? 

4.  Has  each  aspect  of  content  and  other  qualities 
been  assigned  score  points? 

5.  Has  a  procedure  been  established  to  indicate 
partial  demonstration  of  a  listed  aspect? 

6.  Will  each  reader  record,  for  each  essay, 
the  presence  or  absence  of  each  listed 
aspect  contained  in  that  essay? 

7.  Will  a  person  other  than  one  of  the  readers 
be  responsible  for  assigning  score  points 
to  the  essays  evaluated  by  the  readers? 

b.  Global  method: 

1.  Have  the  categories  into  which  an  essay  is 
to  be  classified,  in  terms  of  its  overall 
quality,  been  specified? 

2.  Have  actual  (or  sample  responses)  to  represent 
each  of  the  several  categories  been  identified 
(or  devised)? 

3.  Will  all  essay  readers  read,  rate  and  discuss 
each  essay,  representing  a  category? 

4.  Will  each  essay  be  read  rapidly  and  a  global 
impression  of  its  quality  be  used  to  classify 
it? 

5.  Will  at  least  two  readers  read  and  classify 
each  essay  in  terras  of  its  overall  quality? 

6.  Will  the  sum  or  the  average  of  the  ratings  of 
an  essay  be  used  as  a  final  score  for  that 
essay? 

131 


I 


-96- 


2.11    Editing  Teat  Items 

At  this  point  in  the  development  process,  it  is  important  for  the 
test  constructor  to  check  the  test  items  developed  to  see  if  they  meet 
the  basic  technical  criteria    set  for  items.    The  focus  at  this  point 
should  be  on  the  technical  quality  of  the  items  and  the  suitability  of 
the  directions  to  the  student  about  how  to  respond.    At  a  later  time 
point,  other  reviewers  will  be  asked  to  comment  on  the  content  validity 
of  the  items. 

The  item  review  form  presented  on  the  next  three  pages  will  be 
helpful  to  individuals  interested  in  conducting  a  systematic  technical 
analysis  of  their  items.    Two  points  are  worthy  of  mention.    One,  our 
item  review  form  is  specific  to  multiple-choice  test  items.    Of  course, 
it  will  be  qu.'.te  easy  for  anyone  to  use  our  format  and  principles  for 
preparing  other  types  of  items,  and  design  new  item  review  forms,  one 
for  each  item  type.    Two,  section  two  of  the  item  review  iorm  was  designed 
to  be  content  specific.     In  this  instance,  the  area  was  read. .ng/ language  arts, 
In  different  content  areas,  it  is  likely  that  other  relevant  questions 
would  be  included  in  section  two.    Other  times,  section  two  may  be  deleted. 
The  item  review  form  was  used  recently  (in  a  modified  form)  at  an  item 
writing  workshop  in  the  Montgomery  County  Public  School  System  (MCPS) 
(Rockville,  Maryland).    The  workshop  was  conducted  by  the  two  authors 
with    the    excellent      assistance  Of  Lois  Martin,  Kay  Morgan,  and  Liz 
Flach  from  MCPS.    We  are  grateful  to  them  (and  many  of  their  colleagues) 
for  their  constructive  criticisms  and  helpful  comments. 


132 

^  ERIC 


-97- 


2/17/78 


Item  Review  Form 
(Multiple-Choice) 


Objective  Number:   Test  Item  Number: 

Reviewer:  .   ____ 

Date: 


Objective: 


Test  Item: 


Section  7.    Technical  Quality 

Place  a  HA  under  the  column  corresponding  to  your  rating  of  the  test  item  for 
the  questions  in  this  section  and  the  next  one. 


Yes    Questionable  No 


1.  Is  the  item  stem  clearly  written  for  the 
intended  group  of  examinees? 


2.  Is  the  item  stem  free  of  irrelevant  material?       

3.  Is  a  problem  clearly  defined  in  the  item 

stem?     .   

4.  Are  the  choices  clearly  written  for 

the  intended  group  of  examinees?       

5.  Are  the  choices  free  of  irrelevant 

material?       

133 


-98- 


6.  Is  there  a  correct  answer  or  a  clearly 
best  answer? 

7.  Have  words  like  "always,"  "none,','  or 
"all"  been  removed? 

8.  Are  likely  examinee  mistakes  used  to 
prepare  Incorrect  answers? 

9.  Is    "all  of  the  above11  avoided  as  a 
choice? 


Yes    Questionable  No 


10.  Are  the  choices  arranged  in  a  logical 
sequence  (If  one  exists)? 

11.  Was    the  correct  answer  randomly  positioned 
among  the  available  choices? 

12.  Are  all  repetitious  words  or  expressions 
removed  from  the  choices  and  included 

in  the  item  stem? 

13.  Are  all  of  the  choices  of  approximately 
the  same  length? 


14.  Do  the  item  stem  and  choices  follow 
standard  rules  of  punctuation  and  grammar? 

15.  Are  all  negatives  underlined? 

16.  Are  grammatical  cues  between  the  item  stem 
and  the  choices,  which  might  give  the 
correct  answer  away,  removed? 

17.  Is  the  item  format  appropriate  for 
measuring  the  intended  objective? 

18.  Does  the  test  item  measure  the  intended 
objective? 

19.  Does  the  test  item  measure  only  the 
intended  objective? 


Section  II.     Technical  Quality  Matters  Specific  to 
Reading/Language  Arts  Test  Items 


1.  Can  a  correct  answer  be  given  without 
reading  the  passage? 

2.  Is  the  discourse  appropriate  for 
measuring  the  intended  objective? 

er|c  134 


Yes    Questionable  No 


-99- 


3.  Does  the  discourse  and  test  item  provide 

a  valid  measure  of  the  intended  objective? 

4.  Do  the  following  fall  within  the  range 
for  the  number  of  words  in  each  sentence 
of  the 

(a)  directions? 

(b)  discourse? 

(c)  item  stem? 

(d)  item  choices? 

5.  Do  the  following  fall  within  the  range  for 
the  number  of  sentences  in  the 

(a)  directions? 

(b)  discourse? 

(c)  item  stem? 

(d)  item  choices? 

6.  Is  there  the  desired  number  of  words 
in  the  selection  of  discourse? 

7.  Does  the  test  item  contain  the  desired 
number  of  choices? 

8.  Is  the  ratio  of  common  to  uncommon 
words  correct? 


Yes    Questionable  No 


Suggested  Revisions: 


Final  Rating  (Check  One) ; 


□ 

Accept 


□ 


Accept  (with  revisions- 
see  above) 

135 


□ 

Reject 


-100- 

2.12  References 


The  references  are  divided  into  three  sections:    References  Cited, 
References  for  Further  Study,  and  Measurement  and  Evaluation  Textbooks. 


Z.12.1    References  Cited 

Allendoerfer,  C.  B.    The  utility  of  behavioral  objectives:    A. valuable 

aid  to  teaching.    Mathematics  Teacher,  De cento"'-  1971,  '686,  738-742. 

Anderson,  R.  C.    How  to  construct  achievement  tes'ts  to  assess  comprehension. 
Review  of  Educational  Research,  1972,  42,  145-170. 

Berk,  R.  A.    The  application  of  structural  facet  theory  to  achievement 
test  construction.    Educational  Research  Quarterly,  1978,  3_> 
in  press. 

Bormuth,  J.  R.    On  the  theory  of  achievement  test  items.  Chicago: 
University  of  Chicago  Press,  1970. 

Cronbach,  L.  J.  Test  validation.  In  R.  L.  Thorndike  (Ed.),  Educational 
measurement.  "(2nd  ed.)  Washington:  American  Council  on  Educa- 
tion, 1971. 

Cronbach,  L.  J.,  Gleser,  G.  C,  Nanda,  H. ,  &  Rajaratnam,  N.    The  depend- 
ability of  behavioral  measurements?    Theory  of  generalizability 
for  scores  and  profiles.    New  York:    John  Wiley  &.  Sons,  1972. 

Duchastel,  P.E.,  &  Merrill,  P.  F.    The  effects  of  behavioral  objectives 

on  learning:    A  review  otf  empirical  studies.    Review  of ; Educational 
Research,  1973.,  43,  53-69. 

Ebel,  R.  L.     Evaluation  'and  educational  objectives.    Journal  of  Educational 
Measurement,  1973,  10,  273-279. 

Forbes,  J.  E.    The  utility  of  behavioral  objectives:    A  source  of  dangers 

and  difficulties.  Mathematics  Teacher,  December  1971,  687,  744-747. 

Gagne,  R.  M.    Behavioral  objectives?    Yes!    Educational  Leadership,  February 
1972,  394-396. 

Hambleton,  R.  K.    Applications  of  latent  trait  models  to  the 
development  and  uses  of  criterion-referenced  tests. 
Laboratory  of  Psychometric  and  Evaluative  Research  Report 
No.  91.    Amherst,  MA:    School  of  Education,  University  of 
Massachusetts,  1979. 


136 


-101- 


Hively,  E.,  Maxwell,  6.,  Fabehl,  G. ,  Sension,  D. ,  &  Lundin,  S.  Domain- 
referenced  curriculum  evaluation;     A  technical  handbook  and  a 
case  study  from  the  Mlnnemast  Project.    CSE  monograph  series  in 
evaluation,  No.  1.    Los  Angeles:    Center  for  the  Study  of  Eval- 
uation, University  of  California,  1973. 

Hively,  W, ,  Patterson,  H.  L.,  &  Page,  S.  A.    A  "universe-defined"  system 
of  arithmetic  achievement  tests.    Journal  of  Educational  Measure- 
ment, 1968,  5,  275-290. 

Kneller,  G.  F.    Behavioral  objectives?    No!    Educational  Leadership, 
February  1972,  397-400. 

MacDonald,  J,  B.,  &  Wolfson,  B.  J.    A  case  against  behavioral  objectives. 
The  Elementary  School  Journal.  December  1970,  119-128. 

Millman,  J.    Criterion-referenced  measurement.    In  W.  J.  Popham  (Ed.), 
Evaluation  in  education;    Current  applications.  Berkeley, 
California:    McCutchan  Publishing  Co.,  1974. 

Millman,  J.    Hang  the'hang-ups  about  test  making.    A  paper  presented 

at  the  First  Annual  Johns  Hopkins  University  National  Symposium 
on  Educational  Research,  "Criterion-Referenced  Measurement: 
The  State  of  the  Art,"  Washington,  D.C.,  October  27,  .1978. 

Osburn,  H.  G.     Item  sampling  for  achievement  testing.     Educational  and 
Psychological  Measurement,  1968,  28,  95-104. 

Popham,  W.  J.     Probing  the  validity  of  arguments  against  behavioral  goals 
A  symposium  presentation  at  AERA,  Chicago,  Illinois,  1958.  . 

Popham,  W.  J.    An  approaching  peril:     Cloud-referenced  tests.     Phi  Delta 
Kappan,  1974,  56.,  614-615. 

Popham,  W.  J.     Educational  evaluation.     Englewood  Cliffs,  N.J. :  Prentice 
Hall,  1975. 

Popham,  W.  J.     Criterion-referenced  measurement.     Englewood  Cliffs,  N.J.: 
Prentice-Hall,  1978.  (a) 

Popham,  W.  J.    A  lasso  for  runaway  test  items.    A  paper  presented  at 
the  First  Annual  Johns  Hopkins  University  National  Symposium 
on  Educational  Research,  "Criterion-Referenced  Measurement: 
The  State  of  the  Art,"  Washington,  D.C.,  October,  1978.  (b) 

Scandura,  J.  M.    Problem-solving:    A  structural/process  approach  with 
educational  Implications.    New  York:    Academic  Press,  1977. 

Tinkelman,  S.  N.     Planning  the  objective  test.     Tn  R.  L.  Thorndike  (Ed.), 
Educational  measurement.  (2nd  ed.)   V.'ashington:     American  Council 
on  Education,  1971. 

Traub,  R.  E.     Stirring  muddy  water:    Another  perspective  on  criterion- 
referenced  measurement.     Ontario  Institute  for  Studies  in  Educa- 
tion, mimeo,  1975. 


-102- 


2.12.2    References  for  Further  Study 

Itakor,  li.  I,.    Hoynnd  object  Ivors    Donin In-n'fiTi'nrpd  resin  for  evaluation 
and  instructional  improvement,    liducatlonal  Technology,  1974, 
14,  10-16. 

Coffman,  W.  E.  Essay  examinations.    In  Thorndike,  R.  L.  (Ed.) 

Educational  Measurement.     (2nd  edition)    Washington,  D.C.: 
American  Council  on  Education,  1971. 

Jackson,  R.    Developing  criterion-referenced  tests.    TM  Report  No.  1. 
Princeton,  N.J.:    ERIC  Clearing  House  on  Tests,  Measurement 
and  Evaluation,  1970. 

Nitko,  A.  J.    Froblems  in  the  development  of  criterion-referenced  tests: 
The  IPI  Pittsburgh  experience.    In  C.  W.  Harris,  M.  C.  Alkin,  and 
W.  J.  Popham  (Eds.),  Problems  in  criterion-referenced  measurement. 
CSE  monograph  series  in  evaluation,  No.  3.    Los  Angeles:  Center 
for  the  Study  of  Evaluation,  University  of  California, . 1974. 

Popham,  W.  J.  (Ed.),  Criterion-referenced  measurement:    An  introduction. 
Englewood  Cliffs,  N.J.:    Educational  Technology  Publications, 
1971. 

Posner,  G.  J.,  &  Strike,  K.  A.    A  categorization- scheme  for  principles 
of  sequencing  content.    Review  of  Educational  Research,  1976, 
46,  665-690. 

Roid,  G.  H.,  &  Haladyna,  T.  M.    A  comparison  of  objective-based  and 
modif ied-Bormuth  item  writing  techniques.    Educational  and 
Psychological  Measurement,  1978,  38_,  19-28. 

Shoemaker,  D.  M.    Toward  a  framework  for  achievement  testing.  *  Review 
of  Educational  Research,  1975,  45,  127-147. 

Wesman,  A.  G.    Writing  the  test  item.     In  Thorndike,  R.  L.  (Ed.) 

Educational  Measurement.     (2nd  edition)    Washington,  D.C: 
American  Council  on  Education,  1971. 


138 


-103- 

.  2.12.3    Measurement  and  Evaluation  Textbooks 


Ahmann,  J.S.,  and  Glock,  M.D.    Evaluating  pupil  growth.    (5th  ed.) 
Boston:  Allyn  and  Bacon,  1975. 

Anastasi,  A.    Psychological  testing.     (4th  ed.)    Ned  York:  Macmillan, 
1976. 

Bloom,  B.  S.,  Hastings,  J.  T.,  &  Madaus,  G.  F.    Handbook  on  formative  and 
summative  evaluation       student  learning.    New  York:  McGraw-Hill, 
1971. 

Chase,'  C.I.    Measurement  for  educational  evaluation.    Reading,  MA: 
Addison-Wesley,  1974. 

Cronbach,  L.J.    Essentials  of  psychological  testing.     (3rd  ed.)  New 
York:    Harper  and  Row,  1970. 

Ebel,  R.L.    Essentials  of  educational  measurement.    Englewood  Cliffs, 
NJ:    Prentice-Hall,  1972.  «• 

Gronlund,  N.E.    Measurement'  and  evaluation  in  teaching.     (3rd  ed.) 
New  York:    Macmillan,  1976. 

Hills,  J.R.    Measurement  and  evaluation  in  the  classroom,"  Columbus, 
OH:    Charles  E.  Merrill,  1976. 

Lemke,  E. ,  and  Wiersma,  W,    Principles  of  psychological  measurement. 
Chicago:    Rand  McNally,  1976. 

Lewis,  D.G.    Assessment  in  education.    New  York:    Wiley,  1975. 

Lien,  A.J.    Measurement  and  evaluation  of  learning.     (3rd  ed.)  Dubuque 
IA:    Wra~  C  Brown  Company,  1976. 

Lindvall,  CM.,  and  Nitko,  A.J.    Measuring  pupil  achievement  and  apti- 
tude.     (2nd  ed.)    New  York:    Harcourt  Brace  Jovanovich,  1975. 

Lyman,  H.B.    Test  scores  and  what  they  mean.     Englewood  Cliffs,  NJ: 
Prentice-HalT7  1971. 

Mehrens,  W.A.,  and  Lehmann, -1. J.    Measurement  and  evaluation  in  edu- 
cation and  psychology.    New  Yojkl    Holt,  Rinehart  and  Winston, 
1973. 

Payne,  D.A.    The  assessment  of  learning:     Cognitive  and  affective. 
Lexington,  MA:    D.C.  Heath,  1974. 

Sax,  G.     Principles  of  ed^cJn^onnl_j^  and  evaluation.  Bel- 

mont, CA:    Wadsworth,  197.4. 

Stanley,  J.C.,  and  Hopkins,  K.D.  Fdm-.atlonal  and  psychological  mea- 
surement and  evaluation.  Englewood  Cliffs,  NJ:  Prentice-Hall, 
1972. 

139 


-104- 


Thorndike,  R.  L. ,  &  Hagen,  E.  P.    Measurement  and  evaluation  in  psy 
chology  and  education.     (4th  ed.)    New  York:    Wiley,  1977. 


Additional  Measurement  and  Evaluation  Textbooks 

Blood,  D.  F. ,  &  Budd,  W.  C.    Educational  measurement  and  evaluation.  New 
York:    Harper  and  Row,  1972. 

Dick,  W. ,  &  Hagerty,  N.    Topics  in  measurement:    Reliability  and  validity. 
New  York:    McGraw-Hill,  1971. 

Gronlund,  N.  E.    Constructing  achievement  tests.     (2nd  ed.)  Englewood 
Cliffs,  N.J.:    Prentice-Hall,  1977. 

Hannah,  L.  S. ,  &  Michaelis,  J.  U.    A  comprehensive  framework  for  instruc- 
tional objectives:    A  guide  to  systematic  planning  antt  evaluation. 
Reading,  MA:    Addison-Wesley,  1977. 

Kibler,  R.  J. ,  Barker,  L.  L. ,  &  Miles,  D.  T.    Behavioral  objectives. 
Boston:    Allyn  &  Bacon,  1970. 

Mager,  R.  F. ,  &  Pipe,  P.    Analyzing  performance  problems.    Belmont,  CA: 
Fearon  Publishers,  1970. 

Nelson,  C.  H.    Measurement  and  evaluation  in  the  classroom.    Toronto:  The 
Macmillan  Company,  1970. 

Townsend,  E.  A.,  &  Burke,  P.  J.    Using  statistics  in  classroom  instruction. 
New  York:    Macmillan,  1975. 


Martuza,  V.  P.    Applying  norm-referenced  and  criterion-referenced  measure- 
ment in  education.    Boston:    Allyn  &  Bacon,  1977. 

Popham,  W.  J.    Criterion-referenced  measurement .    Englewood  Cliffs,  NJ: 
Prentice-Hall,  1978. 


140 


Unit  3 

Assessment  of  Content  Validity 


Prepared  By 


Ronald  K.  Hcambleton 
University  of  Massachusetts*  Amherst 

and 

Daniel  R>  Eignor 
Educational  Testing  Service 


March  15,  1979 


141 


Table  of  Contents 

Page 

3.0  Overview  of  the  Unit   1 

3.1  Introduction   2 

3.2  Judgments      of  Content  Specialists   6 

3.2.1  Item-Objective  Match   6 

A.  An  Index  of  Item.  Homogeneity   ° 

B.  Semantic  Differential  Technique   9 

C.  The  Matching  Procedure   12 

D.  Field  Test  of  the  Three  Procedures   16 

E.  Item  Review  Form   17 

v ,  Summary.   1® 

3.2.2  Representativeness  of  the  Test  Items   18 

3.3  Collection  and  Analysis  of  Student  Response  Data   25 

3.3.1  Standard  Item  Indices   28 

3.3.2  Item  Change  Statistic   31 

3.3.3  Items  as  Measures  of  Single  Objectives   34 

3.4  Additional  Editing  of  the  Test  Items   35 

3.5  References   36 

3.5.1  References  Cited   36 

3.5.2  Additional  References    37 


9 

ERJ.C 


142 


-1- 

3,0    Overview  of  the  Unit 

This  unit  was  prepared  to  introduce  practitioners  to  methods  for 
determining  the  content  validity  of  a  set  of  test  items.  Principally 
there  are  two  methods:    Involvement  of  content  specialists  and  the 
collection  and  analysis  of  student  response  data.    In  a  final  section, 
the  matter  of  icem  revisions  based  on  available  data  is  considered. 


143 


-2- 

3.1  Introduction 

Generally  speaking,  the  quality  of  criterion-referenced  test  items 
can  be  determined  by  the  extent  to  which  they  reflect,  in  terms  of  their 
content,  the  domains  from  which  they  were  derived*    The  problem  here  is 
one  of  item  validation;  unless  one  can  say  with  a  high  degree  of  confi- 
dence that  the  items  in  a  criterion-referenced  test  measure  the  intended  in- 
structional objectives,  any  use  of  the  test  score  information 
will  be  questionable.      Thus  far,  the  possible  use  of  item  generation 
forms,  amplified  objectives,  and  domain  specifications  have  been 
considered-    When  item  generation  rules  are  used,  a  high  degree 
of  confidence  in  terms  of  items  measuring    intended  objectives  is 
derived  through  the  direct  relationship  set  up  between    items    and  the 
domain.    This  might  be  called  an  a  priori  approach  to  item  validity; 
the  approach  itself  assures  that  the  items  are  valid,  or  representative 
of  the  domains.  When  amplified  objectives  or  domain  specifications  are 
utilized,  the  domain  definitions  are  never  really  precise  enough  to  assume, 
a  priori,  that  the  items  are  valid.    Thus,  the  quality  of  the  items,  in  a 
context  independent  from  the  process  by  which  the  items  were  generated, 
must  be  determined.    This  is  an  a  posteriori  approach  to  item  validation, 
and  the  procedures  to  be  discussed  are  designed  to  assess  whether  or  not 
a  direct  relationship  between  an  item  and  a  domain  or  objective  exists 
through  analysis  of  data  collected  after  items  are  written. 

There  are  two  general  approaches  that  may  be  used  to  establish  the 
(content)  validity  of  criterion-referenced  test  items.     The  first  approach, 
and  the  approach  we  feel  holds  the  most  merit,  involves  the  judgments  of 


144 


test  items  by  content  specialists.    The  judgments  that  are  mada  concern 
the  extent  of  "match"  between  the  test  items  and  the  domains  they  are 
designed  to  measure.    Questions  asked  of  content  specialists  about 
content  validity  of  test  items  can  be  reduced  to  two  important  ones 
(Hambleton,  1978):   

1.  Is  the  format  and  content  of  an  item  appropriate  to  measure 
some  part  of  r.he  domain  specification? 

2.  Do     the  items  adequately  sample  a  particular  domain? 

The  second  approach  is  to  apply  empirical  techniques,  in  much  the 
same  way  as  empirical  techniques  are  applied  in  norm-referenced  test 
development.    In  fact,  along  with  some  recently  developed  empirical 
procedures  for  criterion-referenced  tests,  several  norm-referenced  test 
item  statistics  can  (and  should)  be  used.    The  problem  is  to  ensure  that 
these  statistics  are  used  and  interpreted  correctly  in  the  context  of 
criterion-referenced  test  development.    There  are  at  least  four  problems 
involved  with  the  use  of  empirical  procedures.    These  problems  are: 

1.  Most  (if  not  all)  of  the  procedures  are  dependent  upon  the 
characteristics  of  the  group  of  examinees  and  the  effects 
of  instruction. 

2.  They  often  requite  sophisticated  techniques  and/or  computer 
programs  which  ate  not  available  to  practitioners. 

3.  When  item  statistics  derived  from  empirical  analyses  of  test 
data  are  used  to  "select"  the  items  for  a  criterion-referenced 
test,  the  test  developer  runs  the  risk  of  obtaining  a  non- 
representative  set  of  items  from  the  domains  measuring  the 
objectives  included  in  the  test. 

4.  Empirical  methods  in  many  instances  require  pre-test  and  post- 
test  data  on  the  same  items.  Pretest  data  are  rarely  collected 
nor  can  they  be.    One  reason  is  that  there  is  a  reluctance  to 
administer  tests  to  examinees  where  there  is  little  change  of 
moderate  or  high  levels  of  performance. 

Several  criterion-referenced  test  theorists  do  espouse  the  use  of 

empirical  procedures  for  validating  test  items.    However,  one  point 


145 


-4- 

to  be  made  in  this  unit  is  that  empirical  procedures  are  less  useful 
than  the  ratings  of  content  specialists  in  the  item  validation  process. 
It  is  often  argued  that  many  item  statistics  will  be  low  (item 
discrimination  indices,  for  example)  because  test  score  variance 
will  be  low,  and  therefore  they  will  be  of  limited  usefulness.  On 
the  contrary,  authors  such  as  Haladyna  (1974)  have  found  that  there 
is  usually  sufficient  test  score  variance.    Also,  test  score  variance 
can  be  assured  by  the  selection  of  a  proper  item  pilot  study  sample. 
Both  "masters"  and  "non-masters"  of  the  content  under  study  should  be 
located  and  included  in' a  pilot  sample.    The  fact  is  that  empirical 
data  is  not  very  useful  for  answering  the  two  content  validity  questions 
introduced  earlier;  and  therefore  empirical  methods  have  limited  use- 
fulness.   On  the  other  hand,  when  construct  validity  evidence  is  being 
sought,  examinee  response  data  is  exactly  what  is  needed.  However, 
empirical  methods  do  have  one  important  use  in  the  content  validation 
process.    According  to  Rovinelli  and  Hambleton  (1977): 

In  situations  where  a  large  sample  of  examinees  is  avail- 
able and  where  the  test  constructor  is  interested  in 
identifying  aberrant  items,  not  for  elimination  from  the 
item  pool  but  for  correction,  the  use  of  an  empirical 
approach  to  item  validation  should  provide  important 
information  with  regard  to  the  assessment  of  item 
validity. 

In  sum,   the  use  of  content  specialists'  ratings  is  the  method  to 
use  for  content  validating  test  items;  empirical  procedures  should  be 
used  only  for  the  detection  of  aberrant  items  in  need  of  correction. 
Unlike  empirical  procedures,  the  use  of  content  specialists'  ratings  is 
not  dependent  on  examinee  group  composition  or  instructional  effects, 


146 


may  not  require  sophisticated  statistical  techniques,  is  not  restricted 
to  highly  structured  content  domains,  and  finally,  can  be  implemented 
easily  in  practical  settings. 


147 


-6- 

3.2    Judgments  of  Content  Specialists 

Content  specialists  must  address  two  questions  in  assessing  content 

validity: 

1.  Is  the  format  and  content  of  an  item  appropriate^  to 
measure  some  part  of  the  domain  specification? 

2.  Do     the  items  adequately  sample  a  particular  domain? 
Methods  are  offered  in  Sections  3.2.1  and  3.2.2  for  addressing  each 
question. 

3.2.1    It  em-Ob  -Jective  Match 

A.    An  Index  of  Item  Homogeneity 

This  technique  is  based  upon  the  original  work  of  Hemphill  and 
Westie  (1950)  in  constructing  personality  tests.    The  mechanism  for 
collecting  data  consists  of  having  content  specialists  rate  each  item 
on  each  of  a  set  of  objectives  by  assigning  a  value  of  +1,  0,  or  -1. 
These  three  possible  ratings  have  the  following  meanings: 

+1  -  defining  feeling  that  an  item  is  a  measure  of  the  objective 
0  =  undecided  about  whether  the  item  is  a  measure  of  the  objective 

-1  ■  definite  feeling  that  an  item  is  not  a  measure  of  the  objective. 
Basically,  the  content  specialist's  task  is  to  make  a  judgment  about 
whether  or  not  an  item  is  reflective  of  the  content  defined  by  a  domain 
specification.    If,  for  example,  there  are  10  objectives  and  30  test 
items,  each  content  specialist  is  required  to  make  300  judgments. 

Rovinelli  and  Hambleton  (1977) t  extended  the  work  of  Hemphill  and 
Westie  by  developing  a  new  statistic  for  providing  a  numerical 
representation  of  the  data.    They  called  this  new  statistic,  the 

148 


-7- 

Index  of  Item-Objective  Congruence.    The  assumptions  ijnder  which  this 


index  was  developed  are:  , 

1.  That  perfect  item  objective  congruence  should  be  represented  by 
a  value  of  +1  and  will  occur  when  all  the  specialists  assign  a 
+1  to  the  item  for  the  appropriate  objective  and  a  -1  to  the 
item  for  all  the  other  objectives. 

2.  That  the  worst  value  of  the  index  an  item  can  receive  should  be 
represented  by  a  value  of  -1  and  will  occur  when  all  the  special- 
ists assign  a  -1  to  the  item  for  the  appropriate  objective  and 

a  +1  to  the  item  for  all  the  other  objectives. 

3.  That  the  value  of  the  index  should  not  depend  on  the  number  of 
content  specialists  or  the  number  of  objectives. 

The  index  of  item-objective  congruence  is  given  by 

+    £  Xi;Jk 


n 

xijk 


<h-i)  jfa  xljk  -  ^  &     +  i;1 


ik  2  (N-l)n 


where 

Ilk        is  the  index  of  item-objective  congruence  for  item  k  on  objective  1 

N  is  the  number  of  objectives  (i=l,  2  N) , 

is  the  number  of  content  specialists  2  n) , 


is  the  rating  (-1,  0,  +1)  o£  item  k  as  a  measure  of  objective 
i  by  content  specialist  j. 


The  choice  of  a  cut-off  score  to  separate  "valid"  from  "non- 
valid"  items  with  the  index    should       be  based  on  experience  with 
content  specialists'  ratings  and  with  the  index  itself.    In  our  work, 
when  we  feel  it  desirable  to  set  a  cutting  score,  we  create  the  poorest 
set  of  content  specialists'  ratings  that  we  would  be  willing  to  accept  as 


148 


-8- 

evidence  for  the  content  validity  of  a  test  item.  The 

value  of  the  index  for  this  set  of  minimally  acceptable  ratings  serves  as  the 
cutting  score  for  judging  the  item-objective    match  for  each  of  the  test  items. 
For  example,  suppose  that  we  have  20  content  specialists  and  10  objectives. 
We  might  desire  that  at  least  15  of  the  content  specialists  match  the  item  . 
to  the  intended  objective  and  that  they  indicate  that  the  item  is  not  a 
measure  of  the  other  nine  objectives.    In  this  case: 

t         9(15)  -  [(-9)  (15)  +  (+1)(15)1  +  (15) 
xik  "  2(9) (20) 

% 

135  -  T-135  +  15]  +  15 
360 

135  +  120  +  15 
360 

-  270 
360 

«  .75 

Note  that  for  this  example,  N  -  10,  n  ■  20.    The  middle  term  in  the  numerator 
indicates  how  the  judges  (we  want  at  least  15  of  them  to  match  the  item  to 
the  intended,  objective  and  indicate  lack  of  match  to  the  other  nine  objectives) 
scored  the  item  on  all  ten  objectives.    That  is,  the  15  gave  a  score  of  -1 
on  nine  objectives  and  a  score  of  +1  on  the  other,  the  intended  objective. 
The  final  term  corrects  the  bias  built  into  the  middle/term  by  adding  back 
into    the  numerator  the  scores  subtracted  out  on  the  middle  term  for  the 
intended  objective.      The  value,  .75,  serves  as  the  criterion  against  which 
item  validities  from  the  content  specialists'  ratings  are  judged. 


ERIC 


-9- 


The  one  major  drawback  of  the  approach  is  that  it  is  very  time 
consuming.    Even  if  content  specialists  are  assigned  only  a  portion 
of  the  domain  specifications  and  test  items  to  review,  the  time  required 
to  rate  the  quality  of  each  of  a  set  of  test  items  against  all  other 
domain  specifications  presented  in  a  set  can  "be  substantial.  Still, 
the  approach  is  especially  useful  if  there  is  reason  to  believe  that 
test  items  may  b*  measuring  several  objectives  simultaneously. 

n      ggmanfir.  Differential  Technique 

This  technique  employs  the  use  of  the  semantic  differential  procedure 
(Osgood,  Suci,  and  Tannebaum,  1957).    Content  specialists  are  presented 
with  an  objective  and  all  the  items  on  which  ratings  are  desired.    They  are 
asked  to  make  a  judgment   which  consists  of  deciding  whether  the  item- 
objective  relationship  is  best  described  by  the  adjective  toward  the  left- 

end  or  the  right-end  of  the  scale. 

The  following  is  an  example  consisting  of  one  objective,  one  item, 
and  two  adjective  scales,  along  with  a  set  of  typical  directions: 

Objective:    Given  the  chemical  formula  for  a  molecule,  determine  the 
—   number  of  atoms  in  a  molecule. 

Item  1:    How  many  atoms  are  there  in  a  molecule  of  sulfuric  acid  H2Scy 


Directions 


Giventhe  objective  and  item  above,  your  task  is  to  make  1^°"  °n 
the  relationship  between  the  objective  and  the  item  on  the  adjective 
scales  indicated  below. 


very 

no  7 


relevant       relevant       feeling       Irrelevant  irrelevant 

very 

Stable       suitable       feeling       unsuitable  unsuitable 


151 


-10- 

The  data  obtained  from  the  use  of  this  technique  (more  adjective 
scales  would  be  desirable)  can  be  analyzed  without  employing  elaborate 
statistical  techniques.    Therefore,  it  can  easily  be  used  in  practical 
settings.    The  information  which  is  needed  is  the  average  scale  score  for  each 
item  on  each  objective  rated  by  the  content  specialists.    However,  the  data 
also  lends  itself  to  more  elaborate  statistical  analysis.    An  examination 
of  the  standard  deviations  of  the  ratings  given  each  item  on  each  of  the 
scales  will  provide  an  indication  of  the  extent  of  agreement  among  the 
content  specialists.    An  average  across  items  and  scales  will  give  a  general 
indication  of  the  extent  of  the  specialists1  agreement.    For  instance,  in  a 
study  done  by  Rovinelli  and  Hambleton  where  there  were  48  items  and  a  5  point 
rating  scale,  the  average  standard  deviation  was  .46.    On  the  level  of  the 
item,  with  the  exception  of  a  few  items,  the  standard  deviations  were 
quite  small,  thus  indicating  substantial  agreement  amont  the  content 
specialists1  ratings. 

A  second  procedure  for  assessing  item-objective  match  involving 
the  use  of  a  rating  scale  was  offered  by  Hambleton  (1978),     In  this  proce- 
dure, content  specialists  are  given    objectives     (or  domain  specifications) 
and  a  set  of  test  items.    Their  task  is  to  rate  the  quality  of  test  items 
as  measures  of  the  intended  objectives  (or  domain  specifications).  A 
copy  of  a  judge's  rating  form  is  presented  in  Figure  3.2.1. 

Again,  the  rating  scale  data  may  be  analyzed  without  employing  any 
elaborate  statistical  procedures.      It  can  easily  be  used  in 
practical  settings  such  as  in  the  classroom  by  teachers.    The  information 
needed  is  the  mean  and  median  rating  assigned  by  a  group  of  content  special- 
ists to  the  items.      An  examination  of  the  range  of  the  ratings 


9 

ERLC 


152 


-11- 


Item  Rating  Form 


Reviewer: 


Date: 


Content  Area:. 


First,  read  carefully  through  the  lists  of  f^J^^^^^ 
items.    Next,  please  indicate  how  well  you  feel  ^e-point 
domain  specification  it  was  written  to  measure.    Please  use  the  rive  po  nt 
rating  scale  shown  below: 


Poor 
1 


Fair 
2 


Good 
3 


Very  Good 
A 


Excellent 
5 


Circle  the  number  corresponding  to  your  rating  beside  the  test,  item  number, 


Objective 
1 


Test  Item 

2 
7 
14 

1 
3 
8 

13 

A 
6 
12 

5 

9 
10 
11 


Item  Rating 


Comments 


2 
2 
2 

2 
2 
2 
2 

2 

2 
2 

2 
2 
2 
2 


3 
3 
3 

3 
3 
3 
3 

3 
3 
3 

3 
3 
3 
3 


A 
A 
A 

A 
A 
A 
A 

A 
A 
A 

A 
A 
A 
A 


5 
5 
5 

5 
5 
5 
5 

5 
5 
5 

5 
5 
5 
5 


Figure  3.2.1    An  example  of  a  judge's  item  rating  form. 


153 


-12- 

given  each  item  provides  an  indication  of  the  extent  of  agreement  among 
the  content  specialists. 

It  is  also  possible  to  determine  the  "closeness11  of  each  judge's 
ratings  to  the  median  responses  of  the  group.    In  some  cases,  when  one 
or  more  of  the  judges  are  "far  out-of-lineM  it  may  be  best  to  eliminate 
their  responses  and  recalculate  the  statistics.    A  summary  and  analysis 
of  the  hypothetical  ratings  of  nine  judges  to  14  test  items  measuring 
four  objectives  is  shown  in  Table  3.2tl. 

C.    The  Matching  Procedure 

A  third  procedure  which  can-j.be  used  to  obtain  the  judgments  of  content 
specialists  involves  the  use  of  a  matching  task.    Content  specialists  are 
presented  with  two  lists:    One  with  test  items  and  another  with  objectives 
(or  domain  specifications).    The  specialist's  task  is  to  indicate  which  objec- 
tive he/she  thinks  each  test  item  measures,  if  any.    A  contingency  table  is 
then  constructed  by  calculating  the  numbers  of  content  specialists  matching 
each  item  to  each  objective  in  the  sets  of  items  and  objectives  being  studied. 
The  chi-square  test  for  independence  can  then  be  used  to  analyze  the  data 
which  is  presented  in  the  contingency  table.    Also,  a  simple  visual  analysis 
of  the  contingency  table  will  reveal  the  amount  of  agreement  among  the 
specialists,  and  the  types  and  location  of  disagreements.    An  example  of  a 
judge's  set  of  directions  for  matching  test  items  and  objectives  is 
presented  in  Figure  3.2.2,  Some  hypothetical  results  are  reported  in 
Table  3. 2.2. 


154 


ERIC 


I 


-13- 


Test 

Objective  Item 


2 
7 

14 


1 
3 
8 

13 


4 
6 

12 


5 
9 

10 
11 


Table  3.2.1 

Summary  and  Analysis  of  Judges'  Ratings 
of  14  Test  Items  A 


Judges'  Discre- 
pancies From 
Median  Responses 


4 
4 
4 


3 
3 
1 
1 


4 
4 
5 


4 

2 
1 
4 


Judges'  Ratings 
3     4      5      6  7 


3 
2 
5 


5 
1 
3 
3 


5 
2 
3 


3 
2 
3 
3 


5 
5 
5 


3 
4 
1 
2 


5 
4 
5 


5 
4 
1 
4 


5 
5 
5 


2 
4 
2 
1 


4 
4 
5 


5 
1 
2 
4 


4 
5 
4 


1 
3 
1 
1 


5 
4 
5 


4 
4 
1 
5 


5 
5 
5 


4 
4 

1 
2 


5 
4 
5 


5 
2 
1 
5 


5 
5 
5 


5 
4 
1 
1 


5 
4 
5 


5 
4 
1 
5 


5 

4 

4.4 

5 

3 

4 

5 

4.4 

5 

4 

5 

5 

4.8 

5 

2 

2 
3 
1 
2 


5 
4 
5 


4 
4 
1 
5 


4 
3 
1 
3 


5 
4 
5 


5 
4 
1 
5 


9    24      1  10 


Summary  Statistics 
Mean       Median  Rang 


3.2 
3.2 
1.3 
1.8 


4.8 
3.8 
4.8 


4.4 
3.0 
1.3 
4.6 


3 
3 
1 
2 


5 
4 
5 


5 
4 
1 
5 


5 
4 
3 
3 


2 
3 
3 


3 
4 
3 
3 


BEST  0 


9 

ERIC 


155 


-14- 


Items/Objectives  Matching  Task 


Reviewer : 


Date: 


Content  Area: 


First,  read  carefully  through  the  lists  Qf  domain  specifications  and 
test  items.    Your  task  ib  to  indicate  whether  or  not  you  feel  each  test 
item  is  a  measure  of  one  of  the  domain  specifications.     It  is,  if  you 
feel  examinee  performance  on  the  test  itegi  would  provide  an  indication 
of  an  examinee's  level  of  performance  in  9  pool  of  test  items  measuring 
the  domain  specification*    Beside  each  objective,  write  in  the  test  item 
numbers  corresponding  the  test  items  which  you  feel  measure  the  objective. 
In  some  instances,  you  may  feel  items  do  not  measure  any  of  the  available 
domain  specifications.     Write  these  test  item  numbers  in  the  space  pro- 
vided at  the  bottom  of  the  rating  form. 


Objective  Matching  Test  Items 


1 


2 


3 


No  Matches 


Figure  3.2.2    An  example  of  a  judge's  summary  sheet  for  the 
items/objectives  matching  task. 


ERIC 


156 


Table  3.2.2 


Summary  and  Analysis  of  the  Judges'  Item/Objective 

Matching  Task  - 


Test 

Objective  Item 


1 
3 
8 

13 


4 
6 

12 


5 
9 

10 
11 


Percentage  of 
Matches  for 
Each  Judge 


"Lemons" 


1 
2 
3 


Number  of 

"Lemons" 
Misidentified 


0 
1 

1 

0 


1 

0 
1 


1 
1 
0 

1 


Judges'  Matches 
3       4       5       6  7 


2 

1 

0 

1 

0 

1 

1 

1 

1 

7 

1 

1 

1 

0 

1 

1 

1 

1 

14 

0 

1 

1 

1 

1 

1 

1 

1 

0 

1 
1 


oiioo 


0 


0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

1 

0 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

0 

0 

1 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

] 

1 

0 

64      43      100    64      71      64      79      86  57 


1 

0 

0 

1 

0 

0 

0 

0 

*  1 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 


Percent  of 
Matches 


78 
89 
89 


78 
89 
33 
11 


89 
44 

100 


89 
78 
1 1 
89 


157 


-16- 

Further,  the  ''accuracy11  of  eaclv  content  specialist  can  be  checked 
if  a  specified  number  of  "lemon"  items  (items  not  measuring  any  of  the 
objectives)  are  introduced  into  the  matching  task.    A  content  specialist's 
effectiveness  can  be  measured  by  the  number  of  "lemon"  items  detected. 
(Hambleton  [1978]  noted  however  that  such  a  method  of  evaluation  would 
not  detect  a  "poor"  judge  if  he/she  was  very  critical  of  many  of  the 
test  Items.)    Content  specialists  who  fall  short  of  some  standard  of 
performance  can  have  their  ratings  removed  from  the  statistical  analysis 
of  item  ratings.    One  example  of  standard  might  be:    A  content  specialist  must 
identify  correctly  at  least  75%  of  the  "lemon"  items. 

D.    Field  Test  of  the  Three  Procedures  . 

Rovinelli  and  Hambleton  (1977)  conducted  an  empirical  study  of 
the  use  of  three  procedures  (test  items  matched  to  objectives  using 
a  three-point  rating  scale,  semantic  differential  scale,  and  a  matching 
task)  using  48  items  and  12  objectives  from  a  ninth  grade  science 
curriculum.    The  reader  is  asked  to  refer  to  the  article  for  details;  we 
will  summarize  their  findings  here.    They  found  that  all  three  methods  did 
provide  useful  information  for  ascertaining    if  an  item  is  a  measure  of  an 
objective.    They  also  found  some  differences  in  the  sorts  of  data  collected 
through  the  use  of  the  three  techniques.     For  instance,  the  data  appeared 
to  show  that  the  content  specialists,  when  using  the  rating  procedure 
(semantic  differential),  judged  the  items  to  be  relevant  measures  of  objec- 
tives other  than  the  intended  ones  more  often  than  when  using  the  matching 
procedure.     They  recommend: 


158 


Given  the  task  of  judging  which  items  are  measures 
of  the  intended  objectives,  the  Hemphill-Westie 
procedure  is  recommended  over  the  other  two  tech-  ■ 
niques.    Two  statements  are  offered  in  support  of 
this  recommendation.    One,  the  numeric  representa- 
tion of  the  data,  the  index  of  item- obj ective, 
congruence,  provides  a  meaningful  interpretation 
of  the  extent  to  which  an  item  is  judged  to  be  a 
valid  measure  of  the  intended  objective.  Two, 
there  are  methods  for  determining  the  reliability 
and  validity  of  the  data  collected.    Further,  these 
methods  can  be  tested  for  statistical  significance. 

Rovinelli  and  Hambleton  offer  three  cautionary  notes  about  the  use  of  the 

Hemphill-Westie  procedure.    One,  the  procedure  does  not  give  information 

on  the  quality  of  the  items  or  the  suitability  of  the  distractors.  Two, 

the  dimensionality  of  the  data  must  be  known  in  advance.     Three,  the 

procedure  is  very  time  consuming  for  a  large  number  of  items  and  objectives, 

In  sum,  when  deciding  upon  which  of  the  procedures  to  use  with  content 

specialists'  ratings,  Rovinelli  and  Hambleton  suggest: 

.   .  .before  selecting  the  type  of  judgmental  ^ 
procedure  to  use,  the  test  constructor  should 
take  into  consideration  the  information  desired 
and  the  resources  available,  and  then  choose 
the  most  appropriate  procedure. 


E.     Item  Review  Form 

An  item  review  form  for  multiple-choice  test  items  in  the  Reading/ 
Language  Arts  area  was  introduced  in  Unit  2.    The  form  appears  again  in 
Figure  3.2.3.    A  summary  of  the  item  reviews  of  several  content  special- 
ists is  very  useful  in  the  content  validation  process. 


159 


-18- 


Recently,  another  item  review  form  and  instructions  for  its  use 
were  developed  and  piloted  in  a  multiple-choice  Item  writing  workshop 
conducted  in  Baton  Rouge,  Louisiana.    The  material  is  presented  in 
Figure  3.2.4.    The  special  feature  of  this  new  form  is  that  the  item 
ratings  for  up  to  ten  test  items  measuring  an  objective  can  be  reported 
on  a  single  page. 

Forms  in  Figures  3.2.3  and  3.2.4  can  easily  be  prepared  for 
otaer  item  formats  using  the  item  writing  principles  offered  in  Unit  2. 

F .  Summary 

Data  from  all  of  the  procedures  sketched  out  in  Section  3.1.1 
are  useful  for  determining  the  content  validity  of  a  set  of  test  items. 
The  data  derived  from  any  one  of  the  procedures  can  be  used  to  answer 
the  following  important  questions: 


1.  Which  items  failed  to  "match"  the  domain  specifications 
they  were  prepared  to  measure? 

2.  How  successful  were  the  test  item  writers? 

3.  How  can  the  content  validity  data  on  the  test  items  be 
used  to  rewrite  domain  specifications? 

4.  Who  were  the  "best"  content  specialists  in  the  rating 
process? 


3.2.2    Representativeness  of  the  Test  Items 

This  step  cannot  be  completed  until  the  test  items  to  be  included 
in  a  test  have  been  selected.     It  is  usually  desirable  to  have  test  items 
in  a  criterion-referenced  test  that  are  representative  of  the  domain  of 
items  indicated  in  a  domain  specification,  i.e.,  criterion-referenced  tests 


160 


-19- 


2/17/78 


Figure  3.2.3    A  sample  multiple-choice  item  review  form. 


Item  Review  Form 
(Multiple-Choice) 


Objective  Number: 


Test  Item  Number: 


Reviewer: 


Date: 


Objective: 


Test  Item: 


Section  I.    Technical  Quality 

Place  a  under  the  column  corresponding  to  your  rating  of  the  test  item  for 
the  questions  in  this  section  and  the  next  one. 

, Yes    Questionable  No 

1.  Is  the  item  stem  clearly  written  for  the 

intended  group  of  examinees?       

2.  Is  the  item  stem  free  of  irrelevant  material?       

3.  Is  a  problem  clearly  defined  in  the  item 

4.  Are  the  choices  clearly  written  for 

the  intended  group  of  examinees?      -  

5.  Are  the  choices  free  of  irrelevant 

material?       

161 


-20- 


6.  Is  there  a  correct  answer  or  a  clearly 
best  answer? 

7*  Have  words  like  "always "none/1  or 
"all"  been  removed? 

8.  Are  likely  examinee  mistakes  used  to 
prepare  incorrect  ans\*ers? 

9*  Is    f,all  of  the  above'1  avoided  as  a 
choice? 


Yes    Questionable  No 


10 t  Are  the  choices  arranged  in  a  logical 
sequence  (if  one  exists)? 

11*  Was    the  correct  answer  randomly  positioned 
among  the  available  choices? 

12.  Are  all  repetitious  words  or  expressions 
removed  from  the  choices  and  included 
in  the  item  stein? 

13*  Are  all  of  the  choices  of  approximately 
the  same  length? 

14.  Do  the  item  stem  and  choices  follow 
standard  rules  of  punctuation  and  grammar? 

15.  Are  all  negatives  underlined? 

16.  Are  grammatical  cues  between  the  item  stem 
and  the  choices  which  might  give  the 
correct  answer  away  removed? 

17.  Is  the  item  format  appropriate  for 
measuring  the  intended  objective? 


ERLC 


Section  II.     Technical  Quality  Matters  Specific  to 
Reading/Language  Arts  Test  Items 


1.  Can  a  correct  answer  be  given  without 
reading  the  passage? 

2.  Is  the  discourse  appropriate  for 
measuring  the  intended  objective? 

162 


Yes    Quest ionab le  No 


-21- 


3.  Does  the  discourse  and  test  item  provide 

a  valid  measure  of  the  intended  objective? 

4.  Do  the  following  fall  within  the  range 
for  the  number  of  words  in  each  sentence 
of  the 

(a)  directions? 

(b)  discourse? 

(c)  item  stem? 

(d)  item  choices? 

5.  Do  the  following  fall  within  the  range  for 
the  number  of  sentences  in  the 

(a)  directions? 

(b)  discourse? 

(c)  item  stem? 

(d)  item  choices? 

6.  Is  there  the  desired  number  of  words 
in  the  selection  of  discourse? 

7.  Does  the  test  item  contain  the  desired 
number  of  choices? 

8..  Is  the  ratio  of  common  to  uncommon 
words  correct? 


Yes    Questionable  No 


Suggested  Re^sionst 


Final  Ratinft  (Check  One) S 


□       .  □  □ 

Accept  Accept  (with  revisions-  Reject 

see  above)  f  x. 


erJc  16?. 


-22- 

Figure  3.2.4  Instructions  for  U9ing  the  Item  Review  Form 


1.  Obtain  a  copy  of  a  domain  specification  and  the  test  items  written  to 
measure  it.  <■ 

2.  Place  the  domain  specification  number,  your  name,  and  today's  date  in 
the  space  provided  at  the  top  of  the  Item  Review  Form. 

3.  Place  the  numbers  corresponding  to  the  test  items  you  will  evaluate 
in  the  spaces  provided  near  the  top  of  the  Item  Review  Form.  The 
numbers  should  be  in  ascending  order  as  you  read  from  left  to  right.  < 
(This  must  be  Hnne  if  processing  of  your  data  along  with  da^a  from 
many  other  reviewers  la  to  be  done  quickly  and  with  a  minimum  number 
of  errors.) 

4.  Read  the  domain  specification  carefully. 

5.  Read  the  first  test-  item  carefully  and  answer  the  first  18  questions. 
Mark  V  for  "yes";  mark  "X"  for  "No";  and  mark  "?"  if  you  are  "unsure." 

The  last  question  requires  you  to  provide  an  overall  evaluation  of  the 
test  item  as  an  indicator  of  the  domain  specification  it'  was  written 
to  measure. 

■n 

\ 

There  are  five  possible  ratings: 


5  - 

Excellent 

A  - 

Very  Good 

3  - 

Good 

2  - 

Fair 

1  »- 

Poor 

6.  Write  any  comments  or  suggested  wording  changes  on  or  beside  the  test 
item. 

7.  Repeat  the  rating  task  for  each  of  the  available  test  items. 

8.  Staple  your  Item  Review  Form,  domain  specification,  and  copy  of  the 
tfcst  items  together. 


164 


Item  Review  Form 
Multiple  Choice 


Domain  Specification  No. 


Reviewer 


Date 


Test  Item  Characteristics  (Mark  "A  for  Yes,  "X"  for  No,  and  "?"  for  Unsure) 


1.  Is  the  item  stem  clearly  written  fojgthe  intended  group  of  students? 

2.  Is  the  item  stem  free  of  irrelevant  material? 
3'.  Is  a  single  problem  clearly  defined  in  the  item  stem? 
A.  Are  the  answer  choices  clearly  written  for  the  intended  group  of  students? 


Are  the  answer  choices  free  of  irrelevant  material 


7.  Have  words  like  "always,"  "none,"  or  "all"  been  removed? 

8.  Are  likely  student  mistakes  used  to  prepare  incorrect  answers? 
T.  Is  "all  of  the  above"  avoided  as  an  answer  choice? 


10.  Are  the  answer  choices  arranged  in  a  logical  sequence  (if  one  exists) 


11.  Was  the  correct  answer  randomly  positioned  among  the  available  answer  choices' 


12.  Are  all  repetitious  words  or  expressions  removed  from  the  answer  choices  and 
included  in  the  item  stem?  .  _  

13.  Are  all  of  the  answer  choices  of  approximately  the  same  length?  


1A.  Do  the  item  stem  and  answer  choices  follow  standard  rules  of  punctuation 

and  grammar?   ■   — 

15.  Are  all  negatives  underlined?  


16.  Are  grammatical  cues  between  the  item  stem  and  the  answer  choices,  which 
might  Rive  the  correct  answer  away,  removed?  .  

17.  Are  letters  used  in  front  of  the  possible  answer  choices  to  identify  them? 

18.  Have  expressions  like  "which  of  the  following  is  not"  been  avo'ded?^ 


19.  Disregarding  any  technical  flaws  which  may  exist  in  the  test  itm  (addressed 
by  the  first  18  questions),  how  well  do  you  think  the  content  of  the  test 
item  matches  with  some  part  of  the  content  defined  by  the  domain  specification 
(Remember  the  possible  ratings:  1-poor,  2-ffair,  3-gooJ>,  A~very  good,  5-excellent) 


ERIC 


165 


Test  Item  Numbers 


! 

ho 

1 


must  be  content  valid.    Only  in  some  highly  special  cases  has  it  been 
possible  to  completely  specify  a  pool  of  relevant  test  items.    For  example 
there  have  been  some  successes  in  the  areas  of  mathematics  and  spelling. 

> 

But  these  examples  are  far  removed  from  the  content  worlds  of  interpreta- 
tive poetry,  creative  writing,  and  finite  projective  geometry.    What  then 
is  to  be  done?    Should  we  "close  up  shop"  and  fade  back  into  the  murky 
interpretative  world  of  norm-referenced  testing? 

For  starters,  test  developers  need  to  work  hard  to  define  and  to 
develop  their  domain  specifications.     If  content  issues  are  clarified 
fully  enough,  content  specialists  can  comment  on  the  apparent  representa- 
tiveness of  items  included  in  a  test.    An  even  better  procedure  is 
Cronbach's  duplication  experiment.    The  experiment  requires  two  teams 
of  equally  competent  item  writers  and  reviewers  to  work  independently 
in  developing  a  criterion-referenced  test.    Cronbach's  (1971)  directions 
are: 

They  would  be  aided  by  the  same  definition  of  relevant 
content,  sampling  rules,  instructions  to  reviewers,  and 
specifications  for  tryout  and  interpretation  of  the 
data.  .  . 

If  the  domain  specification  is  clear,  and  if  sampling  is  repre- 
sentative, the  two  tests  should  be  equivalent.    We  could  check  this  by 
administering  both  tests  to  the  same  group  of  examinees.    One  problem 
is  that  "a  common  blind  spot  is  almost  impossible  to  detect"  (Cronbach, 
1971). 


167 


-25- 


3.3    Collection  and  Analysis  of  Student 
Response  Data 

While  test  score  variability  is  not  a  factor  in  criterion-referenced 
test  construction,  neither  is  it  a  completely  useless  concept.  Indeed, 
variability  will  be  observed  when  a  sample  of  examinees  is  heterogenous 
in  terms  of  their  domain  scores.    By  establishing  a  priori  the  composition 
of  an  examinee  sample,  the  resulting  variability  will  provide  additional, 
helpful  information  for  assessing  test  items.    Haladyna  (1974)  offered 
a  procedure  for  circumventing  the  problem  of  lack  of  variance  in  criterion- 
referenced  test  scores,  thus  allowing  the  use  of  traditional  (norm- 
referenced)  item  discrimination  indices.    If  there  are  two  samples  of 
students,  one  sample  instructed  on  the  objectives  comprising  the  test  and  an- 
other  group  uninstructed,  (or  groups  of  "masters"  and  "non-masters"  after 
,  instruction),  these  samples  can  be  combined,  thus  increasing  the  vari- 
ability in  the  scores  to  the  extent  that  a  traditional  point-biserial 
correlation  coefficient  (an  index  of  item  discrimination)  can  be  utilized. 

Item  statistics,  such  as  discrimination  indices  (Cox  and  Vargas, 
1966;  Crehan,  1974;  Haladyna,  1974;  Henrysson  and  Wedman,  1974) ,  may 
provide  useful  information  for  detecting  "bad"  items.    Indeed,  Wedman 
(1973)  gives  a  compelling  argument  for  using  item  statistics.    He  argues 
that  even  carefully  prepared  domain  specifications  and  precise  item  gen- 
eration specifications  never  completely  eliminate  the  subjective  judg- 
ments that,  to  greater  and  lesser  degrees,  influence  the  test  construction 
process.    In  order  to  guard  against  this  subjective  element,  albeit  small, 
domain  specifications  and  item  generating  procedures  should  be  comple- 
mented with  empirical  evidence  on  the  items. 


16S 


Essentially,  empirical  procedures  involve  the  use  of  various  item 
statistics  that  measure  item  difficulty  and  item  discrimination.  In 
most  instances,  for  these  statistics  to  be  meaningful,  it  is  necessary 
to  have  some  item  variability  across  examinees. 

There  has  been  discussion  on  the  matter  of  item  and  test  variance 
with  criterion-referenced  tests  (Haladyna,  1974;  Millmari  and  Popham, 
1974;  Woodson,  1974a, b).    Our  own  view,  which  is  in  agreement  with  Mill- 
man  and  Popham  (1974),  is  that  item  and  test  variance  are  unnecessary 
with  criterion-referenced  tests.    It  is  important  tuat  a  criterion- 
referenced  test  have  content  validity  and  scores  derived  from  the  test 
must  have  construct  validity  (i.e.,  the  test  must  measure  what  we  say 
the  test  measures).    Construct  validity  can  be  assessed  in  several  ways 
(this  point  will  be  discussed  more  fully  in  Unit  5).    For  example,  some 
variability  of  estimated  domain  scores  could  be  expected  across  a  pool 
of  examinees  consisting  of  "masters"  and  "non-masters"  and  to  the  extent 
that  theve  was  no  (or  limited)  variability,  the  construct  validity  of 
the  teut  scores  (assuming  content  validity  had  been  established)  should 
be  questioned.    The  test  ought  to  reflect  some  variability  of  scores 
across  "masters"  and  "non-masters"  groups  (perhaps  post-  and  pre-instruc- 
tion  groups)  although  one  would  not  select  items  to  maximize  the  difference 
between  scores  in  the  two  groups  since  that  would  make  it  difficult  to 
obtain  "valid"  estimates  of  domain  scores. 

A  point  that  must  be  stressed  here  is  this:     Item  statistics  de- 
rived from  a  field  test  should  not  serve  as  the  sole  criterion  for  re- 
fin  *  *    an  item  pool  or  used  to  construct  a  criterion-referenced  test.  As 
Millman  (1974)  noted,  "Item  statistics  can,  however,  be  used  to  detect 
flawed  items"  (p.  339), 

169 


-27- 

In  discussing  the  various  criterion-referenced  test  item  statistics, 
a  bit  of  background  information  may  be  helpful.    When  criterion-referenced 
testing  received  its  "birth"  in  the  late  sixties  and  early  seventies, 
the  favored  inclination  was  to  try  to  pattern  procedures  for  criterion- 
referenced  tests  after  those  of  the  already  well-developed  norm-referenced 
procedures.    For  norm-referenced  tests,  the  item  indices,  item  difficulty 
and  item  discrimination,  were  helpful  in  making  decisions  about  test  items. 
Naturally,  the  inclination  of  criterion-referenced  theorists  was  to  try 
to  apply  these  indices,  especially  item  discrimination  indices  (such  as 
the  point-biserial  correlation  coefficient)  to  criterion-referenced  test 
items.    The  problem  with  such  an  approach  is  that  these  indices  are  built 
upon  the  concept  of  correlation,  and  correlation  analysis  is  dependent 
upon  a  degree  of  variability  in  the  data.    This  is  not  likely  to  be  the 
case  in  criterion-referenced  testing  situations  where  most  of  the  stu- 
dents should  achieve  mastery  of  the  behavior  in  question,  and  thus  the 
test  scores  will  have  little  variability.    Quite  simply,  the  norm- 
referenced  indices  are  consonant  with  the  implicit  use  of  norm-ref erenced 
tests,  to  facilitate  comparisons  of  students,  and  not  with  the  use  of 
criterion-referenced  tests,  to  indicate  how  much  a  student  knows. 

Various  writers  in  the  criterion-referenced  testing  field  have 
developed  indices  for  detecting  aberrant  test  items  and  these  iwdices 
don't  suffer  the  problem  of  norm-referenced  indices  applied  in  CRT 
situations,  but  there  is  little  agreement  about  which  is  optimal  for  a 
given  situation.    Because  of  this  fact,  and  because  of  the  following 
three  points  we  offer,  we  feel  that  these  indices  should  be  used  with  a  great 
deal  of  caution.     In  particular,  we  feel  the  indices  should  be  used  to 


170 


-28-  ' 


detect  aberrant  items  that  need  to  be  reworked,  and  not  to  make  decisions 
about  which  items  should  and  should  not  be  on  the  test..  The  three  points 
we  r.ake  are: 

1.  The  methods  are  based  on  the  performance  of  a  specific  group 

of  examinees,  and  thus  this  greatly  limits  the  generalizability 
of  the  results. 

2.  It  is  difficult  to  determine  the  impact  of  instruction  on 
these  item  statistics. 

3.  Many  of  the  procedures  require  pre-test  and  post-test  data  on 
the  same  set  of  test  items.    This  data  is  not  likely  to  be 
collected  by  practitioners  in  classroom  settings. 

Nextr  several  of  the  more  promising  item  statistics  will  be  con- 
sidered.    For  an  excellent  review  of  these  and  other  statistics,  the 
interested  reader  is  referred  to  a  paper  by  Berk  (1978). 

3.3.1    Standard  Item  Indices 

There  are.  a  number  of  standard  statistical  indices  which  appear  to 
provide  useful  information  for  determining  whether  the  items  are  adequate 
measures  of  the  instructional  objectives  ;*".hey  were  written  to  measure. 
When  items  in  a  {domain  are  expected  to  be  relatively  homogeneous  (this 
would  be  the  case  if  the  domain  is  defined  narrowly),  it  has  become  a 
fairly  common  practice  for  the  test  developer  ,to  compare  estimates  of 
item  difficulty  parameters,  or  item  discrimination  parameters,  or  both. 
*A  typical  item  analysis  printout  for  three  items  is  shown  in  Figure 
3.3.1.    Since  one  would  expect  items  measuring  an  objective  equally  well 
to  have  similar  item  parameters,  estimates  of  the  parameters  are  com- 
pared to  detect  items  that  deviate  from  the  norm  defined  by  the  remaining 
items.     Such  "deviant"  items  are  carefully  scrutinized.     In  particular, 
content  specialists1  judgments  of  the  fVaviant,!  items  are  considered.  If 
the  items  look  acceptable,  they  are  returned  to  the  item  domain.  (Of 


* 

-29- 


Figure  3.3.1    A  computer  print-out  of  a  standard  item  analysis 


ITEM  PER  CENT  CORRECT  PER  CENT  CORRECT  RBISERIAL  WITH 
NUMBER       FOR  STUDEMTS  WHO       FOR  ALL  STUDENTS         TOTAL  SCORE 

 att£«ipt£d  rm 

15  0.5758  0.5758  0.3999 

NUMBER  OF  STUOENTS  ANSWERING 
EACH  ALTERNATIVE  BY  QUARTER 


HIIAQTPR 

NOT  OMITTED 

1 

2 

3 

4 

5 

TOTAL 

REACHED 

16 

1 
1 

0 

0 

0 

4 

12 

0 

0 

o 
c 

0 

0 

1 

4 

10 

1 

0 

16 

0 

0 

0 

2 

1 

11 

1 

2 

17 

A 

0 

0 

2 

G 

5 

2 

2 

17 

9 

0 

0 

6 

"  IS 

33 

4  . 

'4 

66 

ITEM 

PER  CENT 

CORRECT 

PER  CENT  CORRECT 

RBISERIAL  WITH 

NUMBER 

FOR  STUDENTS  WHO 

FOR  ALL  STUDENTS 

TOTAL 

SCORE 

ATTEMPTED  ITEM 

"  2o 

6.6364 

0. 

6364 

V 

0.3954 

NUMBER  OF  STUOENTS 

r 

ANSWERING 

EACH  ALTERNATIVE  BY 

QUARTER 

QUARTER 

NOT  OMITTED 

1 

2 

3 

4 

5 

TOTAL 

RE ACHfcD 

16 

16 

1 

0 

0 

0 

0 

0 

0 

2 

0 

0 

7 

0 

0 

0 

9 

16 

3 

0 

0 

4 

0 

3 

0 

10 

17 

4 

0 

1 

4 

0 

5 

0 

7 

17 

5 

0 

1 

15 

0 

8 

0 

42 

66 

ITEM  PER  CENT  CORRECT 
NUMBER         FOR  STUDENTS  WHO 


39 


ATTEMPTED 
0.3708 


ITEM 


PER  CENT  CORRECT 
FOR  ALL  STUDENTS 


R3I SERIAL  WITH 
TOTAL  SCORE 


0.3788 


0.4254 


NUMBER  OF  STUDENTS  ANSWER IMG 
EACH  ALTERNATIVE  BY  QUARTER 


QUARTER 

1 

2 
3 
4 


i!OT  OMITTED 
REACHED 

0  0 
0  0 
0  0 
0  0 

 o  


1 

2 

3 

4 

5 

2 

3 

0 

11 

0 

3 

4 

1 

7 

1 

6 

1 

4 

3 

8 

3 

1 

3 

2 

6 

-16 

3 

'  25  ' 

6 

TOTAL 

16 
16 
17 
17 
— 6T" 


172 


course,  it  may  also  be  the  case  that  items  sharing  similar  statistical 
properties  still  do  not  measure  the  intended  objectives  and  the  so 

i 

called  "deviant"  items  do  I    Hence,  there  is  a  need  to  check  the  empirical 
suits  with  the  content  specialists1  ratings.).  A  more  formal  method  of 
comparing  item  difficulty  parameters  is  considered  next. 

Brennan  and  Stolurow  (1971)  present  a  set  of  rules  for  identifying 
criterion-referenced  test  items  which  are  in  need  of  revision.  The 
decision  process  which  they  established  for  deciding  which  items  to 
revise  can  be  used  to  help  assess  item  validity.    However,  our  particular 
interest  is  with  their  procedure  for  comparing  difficulty  levels  of 
items  intended  to  measure  the  same  objective.    Brennan  and  Stolurow  (1973) 
state  that  the  item    scores  from  criterion-referenced  tests  will  most 
likely  not  be  normally  distributed.    Therefore,  in  order  to  determine  if 
the  item  difficulties  are  equal,  they  propose  the  use  of  Cochran's  Q  test. 
This  statistic  can  be  used  to  determine  whether  two  or  more  item  diffi- 
culties  differ    significantly  among  themselves.    Cochran's  Q  is  a  test  of 
the  hypothesis  of  equal  correlated  proportions.    For  a  large  enough  sample 
of  examinees,  Q  is  approximately  distributed  as  a  x    variable  with  K-l 
degrees  of  freedom, where  K  is  the  number  of  test  items.    To  reject  the 
null  hypothesis,  however,  provides  no  guidance  as  to  which  items  are 
significantly  different.     If  the  null  hypothesis  is  rejected,  pair-wise 
comparisons  need  to  be  computed  to  locate  deviant  items. 

Perhaps  we  should  note  here  that  since  the  major  purpose  of 
criterion-referenced  tests  is  to  provide  information  for  describing 
individual  levels  of  mastery,  one  should  compare  item  difficulties  of 
items  intended  to  measure  the  same  objectives  and  v/hich  have  been  ad- 
ministered to  the  same  group  of  examinees  either  before  or  after 

173 


A 

\ 


-31- 

instruction.    While  it  would  be  possible  to  compare  items  administered 
to  different  groups  receiving  the  same  instruction,  the  assessment 
problem  would  become  more  complex.    This  complexity  arises  from  the 
need  to  determine  whether  the  group  compositions  were  the  same  and 
whether  the  instruction  was  equally  effective  in  each  group.    We  note 
though  that  comparing  item  difficulties  only  makes  sens©  as  part  of  an 
item  validation  process  when  the  domain  of  items  spanning  an  objective  ^ 
is  considered  to  be  homogeneous.    There  are  many  times  when  this  assump- 
tion will  be  untenable  (Millman,  1974). 

One  would  also  expect  the  intercorrelations  of  items  intended  to 
measure  the  same  objective  to  be  equal  for  a  group  of  items  homogeneous 
with  respect  to  that  objective  (Brennan  and  Stolu>:ow,  .1971) .     If  the 
test  developer  is  willing  to  assume  that  the  departure  from  normality 
for  scores  on  the  items  is  not  a  crucial  problem,  then  there  is  a 
technique  available  to  test  for  the  equality  of  pairs  of  product  moment 
correlation  coefficients.    When  this  assumption  is  not  tenable,  test 
developers  will  have  to  make  subjective  judgments  as  to  the  equality  of 
these  inter- item  correlation  coefficients. 

3.3.2    Item  Change  Statistic 

The  difference  between  the  difficulty  level  of  an  item  before  and 
after  instruction  describes  another  item  statistic  that  seems  to  have 
some  usefulness  in  the  validation  of  criterion-referenced  test  items. 
However,  an  important  point  to  note  is  that  a  large  difference  between 
the  pretest  and  posttest  item  difficulty  is  not  necessary  since. items 
may  be  valid  indicators  of  the  desired  objectives  but  because  of  poor 
instruction,  there  may  be  very  little  change  in  difficulty  level  between 
the  two  test  administrations.    On  the  other  hand,  if  instruction  is 

174 


effective,  one  would  expect  to  see  a  substantial  change  in  item  diffi- 
culty if  the  item  is  a  measure  of  the  intended  objective.    With  several 
items  intended  to  measure  the  same  objective,  one  could. also  compare 
the  item  change  indices  for  the  purpose  of  detecting  items  that  seem 
to  be  operating  differently  from  the  others. 

Cox  and  Vargas  (1966)  were  the  first  to  suggest  the  use  of  an 
index  linked  to  the  instructional  process.    Their  posttest-pretest 
difficulty  index  is  obtained  by  computing  the  percentage  of  examinees 
who  pass  an  item  on  the  posttest  minus  the  percentage -who  pass  the 
item  on  the  pretest.    Cox  and  Vargas  ranked  items  on  the  basis  of  this 
index  and  correlated  these  rankings  with  those  obtained  through  the  use 
of  a  traditional  norm-referenced  test  item  index  (the  percentage  of 
students  in  the  highest  27%  in  total  posttest  scores  who  pass  the  item 
minus  the  percentage  of  the  lowest  27%  who  pass  the  item) .    The  correla- 
tion between  the  two  item  statistics  were  sufficiently  low  to  allow 
Cox  in  a  later  paper  (1970)  to  note: 

The  pretest  and  posttest  method  of  item 
analysis  produced  results  sufficiently  differ- 
ent from  traditional  methods  to  warrant  its 
consideration  in  those  cases  where  score 
variability  is  not  the  concern,  such  as  in 
criterion-referenced  measures. 

Popham  (1971)  proposed  a  priori  and  a  posteriori  approaches  for 
developing  valid      criterion- referenced  test  items  *    The  a  priori  ap- 
proach corresponds  to  the  determination  of  validity  by  operationally 
obtaining  items  from  an  item  generation  rule.    The  a  posteriori  approach 
consists  of  empirically  determining  whether  or  not  items  are  defective. 
In  his  discussion  of  the  a  posteriori  approach,  Popham  presented  a  new 


-33- 

rueans...  f oxl  empirically,  evaluating  criterion-referenced  test  items.  This 
procedure  represents  an  extension  of  the  item  change  statistic  and  con- 
sists of  constructing  the  following  four-fold  table  from  the  results  of 
a  pre-posttest  administration  of  a  set  of  items  measuring  an  objective: 

I 

i 

Posttest 

Incorrect  Correct 
•  » 
Incorrect  A  B 

Pretest 

Correct  C  D 

A,  B,  C,  and  D  represent  the  number  of  examinees  obtaining  each  of  the 
four  possible  response  patterns  for  an  item.    One  then  computes  the 
median  value  across  items  measuring  the  same  objective  f or  , each  of  the 
four  cells.     (The  median  value  is  not  as  likely  to  be  affected  by 
aberrant  items,  as  would  the  mean.)    These  values  are  used  as  expected 
values  and  a  chi-square  statistic  is  computed  (with  three  degrees  of 
freedom)  for  each  item. 

An  alternate  way  of  looking  at  this  procedure  io  to  consider  the 
median  values  for  the  four  cells  across  the  items  measuring  a  particular 
objective  as  constituting  a  "prototypic"  item.    Then  we  can  contrast 
the  actual  four-fold  frequencies  for  each  item  to  the  frequencies  in  the 
cells  for  this  prototypic  item;  large  values  of  the  "contrasting"  sta-  * 
tistic,  the  chi-square,  would  indicate  an  atypical  item. 

Three  comments  can  be  made  concerning  this  index.    One,  the  index 
is  a  measure  of  homogeneity.    Popham  states  that  this  procedure  was 
more  accurate  than  visual  scanning  in  locating  the  atypical  items. 
While  Popham  (1971)  describes  other  descriptive  statistics,  the  chi-square 


176 


4 

-34- 

/ 


analysis  for  detecting  "bad"  items  seems  to  be  the  most  promising  one 
that  he  offered.    Comments  two  and  three  are  limitations  that  need  to 
be  mentioned.    Popham  (1971)  indicated  that  there  is  no  established 
critical  value  for  chi-square,  above  which  the  items  can  be  regarded 


f  as  aberrant.    Three,  Wedman  (1973)  has  pointed  out  that  the  method 

could  lead  to  the  elimination  of  items  the  test  developer  thinks  are 
aberrant,-  but  it  does  not  take  into  consideration  the  direction  of  the 
abnormality.    If  the  bulk  of  the  items  were,  at  best,  mediocre  in  terms 
of  representing  the  domain,  a  good  item  could  be  eliminated. 

3.3.3    Items  as  Measures  of  Single  Objectives 

The  concern  here  is.  with  determining  whether  or  not  items  are 

measuring  one  objective.    Davis  and  Diamond  (1974)  have  defined  the 

importance  of  this  condition: 

1  » 

Unless  all  the  items  in  a  test  measure  exactly 
the  same  variable  or  variables  for  which  true 
scores  are  highly  correlated  (say,  .90  or  greater), 
it  is  inappropriate  to  use  the  test  for  diag- 
nostic purposes;  that  is,  to  determine  an  examinee's 
level  of  performance  on  a  single  "pure  variable." 
This  is  because  of  the  fact  that  two  different 
examinees  may  obtain  identical   scores  by  marking 
correctly  the  same  number  of  different  items.  .  . 

The  implication  for  the  preparation  of  homo- 
geneous items  for  a  multi-item  diagnostic  test  < 
is  that  each  item  must  measure  only  one  "pure" 
variable  plus  error  or  the  same  weighted  combina- 
tion of  two  or  more  "pure"  variables,  plus  error. 
In  either  of  these,  cases,  the  item  scores  would 
be  found  to  measure,  at  a  pre-selected  level  of 
significance,  the  same  dimension  except  for  errors 
of  measurement  and  for  differences  of  origin  and 
of  units  of  measurement.  .  . 

In  reference  to  testing  whether  a  criterion-referenced  test  item 
measures  more  than  one  objective,  factor  analysis  offers  great  untapped 
potential . 

o  177 

ERIC  x  '  ' 


•  <3 


-35- 


3.4    Additional  Editing  of  the  Test  Items 

'One  problem  for  further  research  involves  the  development  of  a 
method  for  the  integrative  use  of  content  specialists'  ratings  and 
'    .empirical  methods.    While  such  an  integration  of  .approaches  could  be 
accqmplished  through  logical  analysis,  perhaps  a  better  way  to  proceed 
would  be  to  actually  employ  the.  different  techniques,  in  a  variety  of 
situations  and  through  -this  practical- experience,  evolve  a  model  for 

r 

\  the  combined  use  of  content  specialists'  ratings  and  empirical  methods. 

The  work  by  Cronbach  (1971)  may  help  tfo  provide  &  conceptual  framework 
for  this  integration  since  his  treatment  of  test  validity  is  the  most 
comprehensive  to-date.    Much  work  remains  to  be. done  in  this  area. 

Suffice  to  say,  the  test  developer's'  task  is  to  use  the  avail- 
able  empirical  data  from  content  specialists  and  examinees  to  deter- 

m  *  < 

4  \ 

mine  ..whether  in  his/her  best  judgment  the  available  data  supports  the 
hypothesis  that  the  items  are  "valid"  measures  .of  the  intended  objec- 
.  t^wes.  When  the  date  suggests  otherwise,  every  effort  should  be  made 
tJ^reit^se  aberrant  items.  ' 


17S 


-36- 


3.5  References 

3.5.1    References  Cited 

Berk,  R.  A.    A  consumer's  guide  to  criterion-referenced  test  item  statistics. 
Measurement  in  Education,  1978,  9.,  1-12. 

Brennan,  R.  L.,  &  Stolurow,  L.  M.  An  empirical  decision  process  for 
formative  evaluation.  Research  Memorandum  No.  4.  Harvard  CAI 
Laboratory,  Cambridge,  Mass.,  1971. 

Cox,  R.  C.    Evaluative  aspects  of  criterion-referenced  measurement.  Paper 
presented  at  the  annual  meeting  of  AERA,  Minneapolis,  1970.  (ERIC, 
Ed  038  679). 

Cox,  R.  C,  &  Vargas,  J.  S.    A  comparison  of  item  selection  techniques 
for  norm-referenced  and  criterion-referenced  tests.    Paper  pre- 
sented at  the  annual  meeting  of  the  National  Council  on  Measurement 
in  Education,  Chicago,  1966. 

Davis,  F.  B. ,  &  Diamond,  J.  J.  The  preparation  of  criterion-referenced 
tests.  CSE  monograph  series  in  evaluation.  Los  Angeles:  Center 
for  the  Study  of  Evaluation,  University  of  California,  1974. 

Haladyna,  T.  M.    Effects  of  different  samples  on  item  and  test  character- 
istics of  criterion-referenced  tests.    Journal  of  Educational 
Measurement,  1974,  11,  93-99. 

Hambleton,  R.  K.    Validation  of  criterion-referenced  test  score  inter- 
pretations and  standard  setting  methods.    A  paper  presented  at  the  First 
Annual  Johns  Hopkins  University  National  Symposium  on  Educational  Re- 
search, Washington,  D.C.,  1978. 

Hemphill,  J.,  &  Westie,  C.  M.    The  measurement  of  group  dimensions. 
Journal  of  Psychology,  1950,  29,  325-342. 

Hively,  E* ,  Maxwell.  G, ,  Habehl,  G.,  Sension,  D.,  &  Lundin,  S.  Domain- 
referenced  curriculum  evaluation:    A  technical  handbook  and  a 
case  study  from  the  Minnemast  Project.    CSE  monograph  series  in 
evaluation,  No.  1.    Los  Angeles:    Center  for  the  Study  of  Evaluation, 
University  of  California,  1973. 

Millman,  J.    Criterion-referenced  measurement.     In  W.  J.  Pophain  (Ed.), 
Evaluation  in  education:    Current  applications.     Be rke ley, 
California:    McCutchan  Publishing  Co.,  1974. 

Millman,  J.,  &  Pophain,  W.  J.    The  issue  of  item  and  test  variance  for 
criterion-referenced  tests:    A  clarification.    Journal  of 
Educational  Measurement,  1974,  11,  137-138. 

Osgood,  C.  E.,  Suci,  G.  J.,  &  Tannenbaua.,  P.  H.     The  measurement  of 
meaning.  Urbana,  IL:    University  of  Illinois  Press,  3  957. 

Popham,  W.  J.     Indices  of  adequacy  for  criterion-referenced  test  iteius. 
In  W.  J.  Popham  (Ed.),  Criterion- referenced  measurement:  An 
introduction.     Englewood  Cliffs,  N.J.:    Educational  Technology 
Publications,  1971.  «  ^_ 


-s  —    -    •»   ■  — -  -«    •  -  -    -  1 —  •  ""I 


ERIC 


Rovinelli,  R.  J.,  &  Hambleton,  R.  K.    On  the  use  of  content  specialists 
in  the  assessment  of  criterion-referenced  test  item  validity. 
Dutch  Journal  fc        ucational  Research,  1977,  2.,  49-60. 

Wedman,  I.    On  the  evaluation  of  criterion-referenced  tests.    Paper  pre- 
sented at  the  International  Symposium  on  Educational  Testing,  the 
Hague,  the  Netherlands,  1973. 

Woodson,  M.I.C.E.    The  issue  of  item  and  test  variance  for  criterion- 
referenced  tests.    Journal  of  Educational  Measurement,  1974,  11, 
63-64.       (a)  i 

Woodson,  M.I.C.E.    The  issue  of  item  and  test  variance  for  criterion- 
referenced  tests:    A  reply.    Journal  of  Educational  Measurement, 
1974,  11,  139-140.  (b) 


3.5.2    Additional  References 

Berk,  R.  A.    Criterion-referenced  test  item  analysis  and  validation.  A 
paper  presented  at  the  First  Annual  Johns  Hopkins  University 
National  Symposium  on  Educational  Research,  Washington,  D.C.  1978. 

\ 

Crehan,  K.  D.    Item  analysis  for  teacher-made  mastery  testa.    Journal  of 
Educational  Measurement,  1974,  11,  255-262. 

Henrysson,  S.,  &  Wedman,  I.  Some  problems  in  construction  and  evaluation 
of  criterion-referenced  tests.  Scandinavian  Journal  of  Educational 
Research,  1974,  18,  1-12. 

Herbig,  M.     Item  analysis  by  use  in  pre-tests  and  post-tests:  A 

comparison  of  different  coefficients.    Programmed  Learning  and 
Educational  Technology    1976,  13,  49-54. 


ISO 


Unit  4 

Test  Assembly  and  Administration 


Prepared  By 


Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

and 

Daniel  B.  Eignor 
Educational  Testing  Service 


March  15,  1979 


181 


0 


Table  of  Contents 

n  ~  '  ^^^^^ 

Page 

4.0  Overview  of  the  Unit   1 

2 

4.1  Introduction  

o 

4.2  Determination  of  Test  Length  

3 

x         4.2.1  Introduction   J 

4.2.2  The  Basic  Situation   * 

4.2.3  Millman's  Use  of  the  Binomial  Model   j> 

4.2.4  Novick  and  Lewis'  Bayesian  Approach:  Introduction  .  .  10 

A.  Examples  of  the  Effects  of  Different  Priors.  ...  13 

B.  Losses  Associated  with  Incorrect  Decisions  ....  17 

C.  Test  Length  Specifications   1° 

D.  Suggestions  •  

4.2.5  Fhaner's  and  Wilcox's  Use  of  Indifference  Zones  ...  « 

A.  Summary  of  the  Procedure   ^0 

B.  Comments  •  *  *  *  '  * 

4.2.6  Eignor-Hambleton  Approach  for  Determining  Test  Length 

A.  Introduction   ^5 

B.  Research  Design   ^5 

C.  Results  and  Discussion  .'•*,**  1  "  « 

D.  Suggestions  for  Further  Research  and  Development  .  ^ 

4.2.7  A  Method  of  Selecting  a  Procedure  for 

Determining  Test  Length. '  

4.3  Test  Item  Selection   59 

63 

4.3.1  Post  Item  Selection  Checklist   

66 

4.4  Preparation  of  Directions   

4.5  Layout  and  Test  Booklet  Preparation   69 

4.6  Preparation  of  Scoring  Keys   71 

72 

4.7  Preparation  of  Answer  Sheets  

...  73 

4.8  Test  Administration  .   .  .   

76 

4.9  References  Cited  


9 

ERIC 


182 


-1- 


t 

4 


4,0    Overview  of  the  Unit 

This  unit  covers  steps  7  and  9  of  the  Criterion-Referenced  Test 
Development  and  Validation  Model  presented  in  Unit  1.    These  steps  are: 
7.    Test  Assembly 

a.  Determination  of  Test  Length 
Test  Item  Selection 

c.  Preparation  of  Directions 

d.  Layout  and  Test  Booklet  Preparation 

e.  Preparation  of  Scoring  Keys 

f.  Preparation  of  Answer  Sheets 
9.    Tesc  Administration 

Four  procedures  are  offered  in  Section  4.2  for  determining  test 
length.    The  remainder  of  the  material  in  the  unit  (covering  steps 
7b,  .  .        7f,  and  9)  is  straightforward.    Our  discursion  of  these 
steps  for  criterion-referenced  test  development  is  very  similar,  to 
the  discussion  one  would  find  of  these  steps  for  preparing  norm- 
referenced  tests. 

/ 

Note:     It  is  likely  that  some  of  the  material  in  Section  4.2  will 
be  more  meaningful  if  Units  5  and  6  are  studied  first. 


o 

ERIC 


183 


4.1  Introduction 

In  Unit  4,  we  will  discuss  research  and  procedures  directed 
toward  the  assembly  and  administration  of  a  criterion-referenced  test. 
In  many  of  the  sections  we  will  offer  checklists 

that     can     aid  in  the  process.    The  material  presented  here  will  vary 
greatly  in  difficulty,  and    in  length  of  presentation.    For  instance, 
a  great  amount  can  be  presented  about  how  to  go  about  determining  the 
number  of  test  items  per  objectivei  while  little  can  be  said  about  the 
preparation  of  test  directions.     In  the  sections  that  duplicate 
established  principles  for  norm-referenced  tests,  we  have  presented  a 
synthesis  of  the  research  pertaining  to  the  section,  and  have  directed 
readers  to  the  source  or  sources  from  which  it  came. 


184 


V 


-3- 

i 

o 

4.2    Determination  of  Test  Length 
4.2.1  Introduction 

The  length  of  a  criterion-referenced  test  (or  more  importantly, 
the  number  of  test  items  measuring  each  objective  in  a  test)  is  directly 
related  to  the  usefulness  of  the  criterion-referenced  test  scores  ob- 
tained from  the  test.    Short  tests,  typically,  produce  imprecise  domain 
score  estimates,  and  lead  to  mastery  decisions  which  prove  to  be  incon- 
sistent across  parallel-form  administrations  (or  retest  administrations). 
Therefore,  criterion-referenced  test  Scores  obtained  from  short  tests 
have  limited  value.    When  estimation  of  domain  scores  is  of  concern,  the 
relationships  among  domain  scores,  errors  of  measurement,  and  test 
length,  as  summarized  in  the  item-sampling  model,  are  well  known  (Lord  "S^ 
and  Novick,  1968)  and  provide  a  basis  for  determining  test  length. 

When  using  criterion-referenced  tests  to  assign  examinees  to.*  * 

mastery  states,  the  problem  of  determining  test  lergth  can  be  related  to  the 

*.  * 

number  of  classification  errors  one  is  willing  to  tolerate.  One  way 
to  assure  low  probabilities  of  misclassif lcation  is  to  make  the  test 

very  long.    However,  this  is  not  usually  feasible.      Currently,  there 

\ 

exist  at  least  two  ways  to  reduce  classification  errors  without  lengthen- 
ing a  test.    One  involves  utilizing  Eayesian  estimation  procedures  incorporating 
prior  and  collateral  information.    The  second  involves  the  implementation  of  an 
adaptive  testing  scheme  especially  designed  for  hierarchically-structured  ob- 
jectives (see  Hambleton  &  Eignor,  1978;  Spineti  &  Hambleton,  1977). 

The  material  and  procedures  to  follow  can  be  separated  roughly  into 
four    sections.    One  section  involves  the  work  of  Millman  (1973X  utilizing 
the  binomial  model.    The  second  section  involves  the  work  of  Novick  and 
Lewis  (1974),  using  Bayesian  estimation  procedures.    The  third  section 
involves  the  specification  of  an  "indifference  zone."    The  work  of  Fhaner 

185 


(1974)  and  Wilcox  (1976)  will  be  considered  here.    The  final  section  includes 
the  work  of  Eignor  and  Harableton  (1979)  relating  test  length  and  cut-off 
scores  to  several  reliability  and  validity  indices, 

« 

4.2.2    The  Basic  Situation  Revisited 

—  

Regardless  of  which  solution  one  adopts  to  the  test  length  problem, 
the  basic  situation  remains  the  same.     It  is  as  follows:    There  is  a 
domain  or  population  of  test  items  (it  may  be  real  or  have  to  be  hypo- 
thesized) .    These  items  deal  with  a  particular  objective  and  are  of 
varying  unspecified  difficulty.    We  want  to  pass  the  student  on  the 
objective  if  he/she  can  answer  a  given  percentage  of  items  in  the  domain. 
This  actual  percentage  of  items  the  student  could  pass  on  the  whole  domain 
or  population  of  items  can  be  called  hif /her  domain 

score.    Practical  constraints,  such  as  time,  force  the  test  practitioner  to 
have  to  estimate  this  domain  score  by  taking  a  random  or  stratified  random 
sample  of  items  from  the  domain  and  testing  on  those  items.    Now,  because 
the  test  score  is  based  on  a  sample  from  the  domain,  it  is  not  likely  to 
coincide  with  the  domain  score.    There  will  be  error,  and  from  the  point 
of  view  adopted  for  criterion-referenced  tests,  we  view  the  error  as  error 
in  the  decision  process.     That  is,  the  extent  that  an  individual's  test 
score  is  discrepant  from  his/her  domain  score  can  be  viewed  as  a  problem 
involving  the  probability  of  classifying  that  individual  improperly,  i.e., 
as  a  false  positive  (a  non-master  who  is  assessed  as  a  master  on  the  test) 
or  a  false  negative  (a  master  assessed    as  a  non-master  on  the  test).  Logic 
dictates  that  the  longer  the  test  is,  the  less  the  chance  of  making  classi- 
fication errors.     Practicality  dictates  against  having  long  tests,  due  to 
time  problems,  construction  problems,  etc.     Thus,  the  Ccncern  becomes  one  of 
determining  what  minimal  test  length  is  sufficient  in  terms  of  the  problem 
of  classif ication^errors.  1  OP 


4,2,3    Millman's  Use  of  the  Binomial  Model 

Millman  (1973)  considered  the  error  properties  of  mastery  classi- 
fication decisions  made  by  comparirtg  a  domain  score  estimate  to  an 
advancement  score.    By  introducing  the  binomial  test  model,  it  is  simple 
to  determine  the  probability  of  misclassif ication,  conditional  upon  an 
examinee's  domain  score,  an  advancement  score,  a  cut-off  scor^,  and  the 
number    f  items  in  the  test.     (An  advancement  score  is  distinguished  from 
a  cut-off  score  in  Millman's  work  in  the  following  way:    The  advancement 
score  is  the  minimum  number  of  items  that  an  examinee  must  answer  cor- 
rectly to  be  assigned  to  a  mastery  state.    The  cut-off  score  is  the  point 
on  the  domain  score  scale  used  to  separate  examinees  into  true  mastery 
and  true  non-mastery  states.)    By  varying  test  length  and  the  advancement 
score,  an  investigator  can  determine  the  test  length  and  advancement 
score  that  produces  a  desired  probability  of  misclassif ication  for  a 
given  domain  score. 

By  making  the  following  assumptions,  Millman  was  able  to  obtain  a 

«> 

solution  to  the  test  length  problem: 

1.  The  test  is  a  random  sample  of  dichotomously  scored  (n-1)  items 
from  the  domain, 

2.  The  likelihood  of  correct  response  is  a  fixed  quantity  across 
all  test  items  for  an  individual, 

3.  Responses  to  questions  on  the  test  are  independent,  and 

4.  Errors  fit  the  binomial  test  model. 

No  assumptions,  involving  item  content  or  difficulty  are  necessary, 
nor  are  any  group  based  indices  used.    Millman  (1973)  compared  the  situa- 
tion to  the  usual  urn  example  used  for  explaining  the  binomial  distribution 

Rather  the  items  which  an  examinee  can  pass  and 
those  the  individual  fails  are  analogous  to  two 
colors  of  balls  in  an  urn.    Continuing  the  analogy, 


18? 


.  -6- 


the  test  length  question,  How  many  balls 
must  be  sampled  (items  administered)  from  the 
urn  so  that  the  percent  of  all  balls  of  a 
given  color  (test  items  in  the  domain  answered 
correctly)  can  be  estimated  accurately?  The 
errors  associated  with  other  examinees  are  of 
no  concern. 

Table  4.2.1  can  be  used  to  obtain  the  probability  that  a  student 
with  a  particular  domain  score  will  be  incorrectly  advanced  or  retained 
by  the  procedures.     (It  is  assumed  that  some  meaningful  method  has  been 
utilized  to  arrive  at  the  cut-off  score.)    The  following  comments  can 
be* made  concerning  the  use  of  Table  4.2.1: 

1.  To  the  left  of  the  dotted  line  indicating  the  cut-off 
score  is  the  expected  percent  of  students  who  will  be 
advanced  incorrectly.    Likewise,  to  the  right  of  the 
dotted  line  is  the  expected  percent  of  students  who 
will  be  incorrectly  retained.    Their  domain  scores  are 
greater  than  the  criterion  level-    In  other  words,  to 
the  left  of  the  line  are  the  false-positive  error  rates 
and  to  the  right  are  the  false-negative  error  rates. 

2.  A  larger  proportion  of  the  students  whose  domain  scores 
are  close  to,  or  at  the  cut-off  score,  will  be  incor^ 
rectly  classified  than  those  at  a  greater  distance  from 
the  cut-off  score.    Sometimes  this  proportion  is  greater 
than  half.    For  instance,  for  a  cut-off  score  ■  .75  on 

a  test  with  8  items  that  has  an  advancement  score  of  6, 
a  student  whose  domain  score  is  70  will  be  incorrectly 
advanced  (passed)  55%  of  the  time. 

3.  Millraan  looks  at  the  probability  that  a  student  will 
attain  a  particular  test  score,  given  his/her  domain 
score.    However,  an  examinee's  domain  score  is  an  unknown. 
It  is,  of  course,  the  purpose  of  testing  in  the  first 
place!    On  the  other  hand,  it  is  usually  not  too  diffi- 
cult, in  most  situations,  to  make  an  educated  guess. 


Example  1: 

For  a  cut-off  score  of  .80,  suppose  a  practitioner  is  willing  to  ac- 
cept a  25%  misclassification  error  for  those  students  whose  domain 
scores  are  70%  and  90%.    How  large  should  the  random  sample  cf  items  be, 


188 


Table  4.2.1  Percent  of  Students  Expected  to  be 

Incorrectly  Advanced  or  Retained 

Cut-off  Score  ■    70  . 

  4) 

Advance-             0  *  Student's  Domain  Score* 
ment  No.«,of 

Score       Test' Items  50  55  60  65  i  70  75  80  85  90  95 

6                 7  6  10  16  23  j  67  55  kl  28  15  4 

6  8  15  22  32  A3  j  45  32  20  11  4  1 

7  9  9  15  23  34  i  54  40  26  14  5  1 

7  10  .  17  27  38  51  !  35  22  12  5  1 

8  11  11  19  30  43  i  43  29  16  7  2 

9  12  7  13  23  35  !  51  .35  20  9  3  - 

10  13  5  9  17  28  j  58  42  25  12  3  - 

11  14  3  6  12  22  i  64  48  30  15  4 

12  15  2  4  9  17  !  70  54  35  18  6 

Cut-off" Score  =  ■ 75 

Advance-  Student's  Domain  Score* 

ment          No.  of  . 

Score       Test  Items  50  55  60  65  70  j    75  80  85  90  95 

6  8  15  22  32  43  55  j    32  20  11  4  1 

7  9  9  15  23  34  46  j    40  26  14  5  1 

8  10  6  10  17  26  38  j    47  32  18  7  1 

9  11  3  7  12  20  31  j  55  38  22  9  2 
9                12  7  13  23  35  49  |    35  20  9  3 

16  20  1       2  5  12  24  j  58  37  17  4 

17  21  -       1  4  9  20  j  63  41  20  5 

18  22  -       1  3  7  17  j  68  46  23  6  - 


189 


-8- 


Table  1  (continued) 

C 

Cut-off  Score  «  .80 


Advance- 
ment 
Score 

No.  of 
Test  Items 

50 

55 

60 

O L UUCU t 

65 

'  <5 
o 

70 

75  | 

uLU  JL  Is 

80 

* 

85 

90 

95 

6 

7. 

6 

10 

16 

23 

33 

45  j 

42 

28 

15 

4 

1 

Q 
O 

t 

/ 

1  1 
11 

1  7 
1  / 

0  £ 
ZD 

17  \ 
\ 

1  Q 

f, 

8 

9 

2 

4 

7 

12 

20 

30  j 

56 

40 

23 

7 

Q 
O 

o 

J.VJ 

1  7 

JO 

53  I 

1 

32 

18 

7 

1 

9 

ii 

3 

7 

12 

20 

31 

46  | 

38 

22 

9 

2 

10 

12 

2 

4 

8 

15 

25 

39  j 

44 

26 

11 

2 

11 

13 

1 

3 

6 

11 

20 

33  j 

50 

31 

13 

2 

1  0 

12 

13 

Z 

A 

H 

Q 

J.  / 

HO  J 

35 

18 

6 

on 

1 

J. 

0 

c 

A 
H 

1  1 

±  J. 

59 

35 

13 

2 

1  O 

1 

J. 

7 

9 

16  1 

67 

4 

42 

17 

2 

Cut-off 

Score  °  .85 

Advance- 
ment 
Score 

No.  of 
Test  Items 

50 

55 

60 

Qf*  ilrl  en  f" 

65 

t  s 

70 

Tennis*!  ri 

75 

Score*  ' 
80  j  85 

90 

7 

8 

7 

11 

17 

26 

37 

50  | 

i 

34 

19 

6 

3 

9 

2 

4 

7 

12 

20 

30 

44  j 

40 

23 

7 

9 

10 

1 

2 

5 

9 

15 

24  . 

*38  | 

46 

26 

9 

10 

11 

1 

1 

3 

6 

11 

20 

32  | 

51 

30 

10 

11 

12 

1 

2 

4 

9 

16 

28  j 

56 

34 

12 

17 

19 

1 

2 

5 

11 

24  { 

56 

29 

7 

19 

21 

1  ► 

3 

8 

18  j 

63 

35" 

8 

*A  domain  score  Is  the  proportion »of  items  a  student  would  be  able 
to  answer  correctly  if  he/she  were  given  the  entire  pool  of  items  measuring 
an  objective. 

(Reproduced  from  Novick  and  Lewis,  1974,  with  permission  from  the 
authors.     Decimal  points  have  been  omitted.) 


9 

ERIC 


•  ISO 


and  what  should  the  advancement  score  be? 

Answer :    From  Table  l,it  can  be  seen  that  for  a  test  of  8  items 
with  a  passing  (advancement)  score  of  7,  26%  of  the  students  at  the  70% 
level  and  19%  at  the  90%  level  will  be  misclassif led. 

Example  2; 

For  a  cutting  score  of  .85,  suppose  a  practitioner  is  willing  to 
accept  a  10%  misclassif ication  rate  for  students  whose  domain  score  fs 
.95.    How  many  questions  should  the  random  sample  (test)  have, and  what 
should  the  advancement  score  be?  / 

Answer:    11  items  with  an  advancement  score  of  10. 

The  primary  problem  in  applying  the  tables  prepared  by  Millman 

(1973)  is  that  one  would  need  to  have  a  good  prior  estimate  of  an 
examinee's  domain  score.    Other  problems  have  been  suggested  by  Novick  • 
and  Lewis  (1974):  -'They  reported  that  for  certain  combinations  oi  cut- 
off  scores  and  test  length, "changing  one  or  both  to  decrease  the  prob- 
ability of  misclassification  for  those  above  the  cut-off  score  will 

o 

actually  increase  the  probability  of  misclassification  for  those  below 
the  cut-off  score.     In  order  to  choose  the  appropriate  combination  of 

t 

test  length  and  advancement  score,  one  must  have  some  idea  of  whether 
the  preponderance  of  examinees  are. above  or  below  the  cut-off  score  and 
one  must  have  some  idea  of  the  relative  costs  of  misclassification.  How- 
ever,  the  first  requirement  can  only  be  satisfied  with  prior  information 
about  the  domain  scores  of  the  group  of  examinees.    Novick  and  Lewis 

(1974)  suggested  that  it  would /be  useW*~€o  have  some  systematic  way  of 
incorporating  prior  knowledge  into  the  test  length  determination  problem. 


191 


<4 

'•     .  / 

/ 

/ 

Table  4,2.2  below,  from  Novick  and  Lewis  (1974) ,  highlights  the 

problem  they  raise:. 

<  #. 

Table  4.2.2  Percent  of  Students  Expected  To  Be  Incorrectly 
Advanced  or  Retained 

Criterion  Level  =  .75    Test  Length  =  8  * 

i 

Advancement  Domain  Score  Level 

Score  50  55  oO  65  70  I  75  80  85  90 
 I  «  

i 

6  15       22        32       43       55    |    32       20       11  4 

7  4       7         11       11       26    |    63       50       34  19 

 l  


Suppose  7  out  of  8  were  taken  as  the  minimum  advancement  score. 
Then  for  students  whose  true  levels  are  <.73,  the  probabilities  of  mis- 
classification  fall  off  dramatically,  while  for  students  whose  true  levels 
are  >.75  (more  than  likely  where  most  of  the  students  are  located), 
these  probabilities  remain  quite  hig^i.    This  would  be  the  area  where  one 
would  want,  the  probability  to  be  lower.    Novick  and  Lewis  conclude  that  a 
,fframework  would  need  to  take  into  account  on  which  side  of  ,75  small 
expected  errors  were  considered  to  be  more  important." 

*4.2.4    Novick  and, Levis1  Bayesian  Approach:  Introduction1 
Instead  of  considering  the  probability  that  a  student  will  attain 
a  test  score,  given  his/her  true  level  (an  unknown),  it  would  be  better 
to  consider  the  probability  that  a  student* s    domain  score 
exceeds  a  given  cutting  score,  given  his/her  test  score.    A  Student  will 


*An  excellent  introduction  to  Bayesian  methods  is  given  by  Novick, 
M.  R. ,  &  Jackson,  P.  H.     Statistical  methods  for  educational  and  psycho- 
logical research.     New  York:    McGraw-Hill,  1974.    Chapter  5  is  especially 
relevant  for  our  work  in  this  unit.  . 

192 


95 

1 
6 


I 


i 


-11- 
th  en  be  passed  on  to  the  next  unit  only  if  there  is  a  sufficiently  high 
'  probability  that  his/her    domain  score  exceeds  the 

cutting  score,  given  his/her  test  sr.ore.    The  procedures  offered  by 
Novick  and  Lewis  allow  such  a  probability  to  be  assessed.    According  to 

Novick  and  Lewis: 

To  obtain  cm  "necessary  probability  an  application  of 

Baye's  theorem  is  required.     In  such  an  analysis  prior 

knowledge  (expressed  in  probabilistic  terms)  of  the 

student's  true  level  of  functioning  is  combined  with 

the  (binomial)  model  information  relating  the  observed 

test  score  to  true  level:  the  result  is  a  posterior  "t 

probability  for  true  level  of  functioning,  given  the 

test  score.    The  probability  this  distribution  assigns 

to  levels  above  the  criterion  is  the  quantity  of 

interest. 

Novick  and  Lewis  produced  a  table  (Table   A... 3)  reporting  values  of 
Prob  (tt  >,  irjx,  n),  i.e.,  the  probability  of  an  examinee  having  a  domain 
score  greater  than  or  equal  to  a  cut-off  score  ttq  with  a  proportion  cor- 
tect  score  of  x/n,  for  typical  values  of  ttq,  x,  and  &  used  in  objectives- 
based  instructional  programs.     (Actually  the  test  lengths  considered  in 
their  paper  are  a  little  longer  than  those  often  used  in  practice.  The 
shortness  of  many  criterion-referenced  tests  in  use  today  is  due  in  part 
to  the  failure  of  users  to  have  any  idea  about  the  number  of  classifi- 
cation errors  that  are  made  with  criterion-referenced  tests.)     In  Table  4.2.3 
tt0  takes  on  values  ranging  from  .50  to  .95  (in  increments  of  .05),  test 
scores  vary  from  6  to  11,  and  test  lengths  vary  from  8  to  12.  Their 
j  table  can  be  used  to  select  both  an  advancement  score  and  test  length  to 

ensure  that  Prob  (tt  >  ttq)  is  larger  than  some  desired  value,  (say  70%)  • 
For  example,  if  an  instructor  desired  to  ensure  that  Prob  (tt  £  -80)  was 
greater  than  .70,  using  the  Novick-Lewis    Table  4.2.3,  it  can  be  seen  that 
an  examinee  should  achieve  8  of  8  test  items. 


9 

ERIC 


19 


Table  4.2.3    Probability  Student's  Domain  Score  Is  Greater 
Than  it0  Given  a.  Uniform  Prior  Distribution 


Minimum 

Advancement       No,  of  Posterior 
Score          Test  Items  Distribution 

jU 

c  c 
J  J 

AO 

ou 

Cut- 

OJ 

-off 

/U 

Score — tt 
75    80  85 

on 

V  J 

0 

Q 

o 

91 

85 

77 

66 

54 

An 

5 

1 

1 

Q 

o 

98 

96 

93 

88 

80 

in 

JO 

AD 

4U 

23 

7 

o 
O 

Q 

o 

100 

100 

99 

98 

96 

Q9 

fl7 

o  / 

77 

61 

37 

7 

Q 

7 

95 

90 

83 

/4 

62 

A  7 

1  ft 

J.O 

7 

1 

8 

9 

8(9,2) 

99 

98 

95 

91 

85 

76 

62 

46 

26 

9 

9 

a  ( i  ~m 
fs  UU,1; 

100 

100 

99 

99 

97 

y  H 

oy 

oU 

65 

40 

7 

10 

6(8,4)  • 

89 

81 

70 

57 

43 

29 

16 

7 

2 

8 

10 

6(9,3) 

97 

93 

88 

80 

69 

54 

38 

22 

9 

2 

9 

10 

6(10,2)  • 

99 

99 

97 

94 

89 

80 

68 

51 

30 

10 

8 

11 

6(9,4) 

93 

87 

77 

65 

51 

35 

21 

9 

3 

9 

11 

B(10,3) 

98 

96 

92 

85 

75 

61 

H 

26 

11 

2 

10 

11 

'  8(11,2) 

100 

99 

98 

96 

92 

84 

73 

56 

34 

12 

9 

12 

8(10,4) 

95 

91 

83 

72 

58 

42 

25 

12 

3 

10 

12 

B(ll,3) 

99 

97 

94 

89 

80 

67 

50 

31 

13 

2 

11 

12 

B(12,2) 

100 

100 

99 

97 

94 

87 

77 

60 

38 

14 

(Tables  A* 2.3  thru  4.2.11  are  reproduced  from  Novick  and  Lewis,  1974,  with 
permission  from  the  authors •) 


194 


/'  ft 


-13- 


A .    Examples  of  the  Effects  of  Different  Priors 

In  this  section,  we  will  present  tablet   from  Novick  and  Lewis  that 
demonstrate  the  effects  of  specifying  different  priors.    Consider  two 
situations.    We  will  label  them  "A"  and  "B" . 

In  situation  A,  we  know  very  little  at  all  about  a  student's 
domain  score  prior  to  test  administration.      Hence,  we  select 
as  our  prior  a  uniform  distribution  (6 [1,1])  on  the  interval  from  zero 
to  unity.     Table  4.2.3  provides  the  posterior  probabilities  for  various 
test  lengths  and  cut-off  scores. 

To  use  Table  4. 2\  3  to  select  test  length,  one  must  decide  on  the 
cut-off  score  and  the  minimum  acceptable  probability  that  a  student's  • 
domain  score      exceeds  this  cut-off  score. 

■  r 

Example  3:  ' 

If  we  take  the   cut-off  score  (ttq)  to  be  .80  and  the  minimum  prob- 
ability to  be  .5  (e.g.,  Prob  (ir£7rjx,n)  -  .5  where  x  -  test  score,  n 
is  test  length),  wha'„  is  the  minimum  number  of  test  items  that  can  be  used, 
and  what  is  the  minimum  advancement  score? 

Answer:     8  items  with  an  advancement  score  of  7,  because  Prob  (tt>tto|7,8) 

  c 

.56,  which  is  greater  than  .5  0.  , 
Example  4; 

Suppose  the  cut-off  score  is  ttq  =  .90  and  we  want  Prob  (ir£.9|x,n)  - 
.5,  what  is  the  minimum  number  of  test  items  that  can  be  used,  and  what 
is  the  minimum  advancement  score? 

Answer:     There  is  no  numbed  of  test  items  and  advancement  score  less 
than  perfection  that  satisfies  this  conditioner  the  items  specifications 


195 


in  the  table.     Note  that  Prob  (its . 9 1 8, 8)  -  .61  and  Prob  (ir*.9|9,9)  - 
The  answer  is  8  items,  and  an  advancement  score  of  100%  (8  out  of  8 
items  answered  correctly). 

In  situation  B,  suppose  that  our  probability  that  a  student  is 
functioning  abovt  the  cutting  score  'of  .80  is  .75.     (When  we  specified 
the  uniform  prior,  we  set  the -prior  probability  at  .20.)    Novick  and 
Lewis  note  that  this  belief  can  be  characterized  by  the  beta  prior 
distribution  B (10. 254,  1.746).    Table  4.2.4  gives  the  same  sort  of  infor- 
mation as  Table    4.2.3,  but  is  based  upon  a  revised  prior.  * 

To  use  Table  4.2.4    one  must  again  set  a  cut-off  score  and  decide  on 
a  minimum  acceptable  probability  that  a  student's  domain  score  exceeds 
this  cutting  score. 

Example  5 : 

Suppose  the  cut-off  scoreis  ir0  =  .  90  and  we  want  Prob(iT£.  9 1  x,n)  -  .5. 
What  is  the  minimum  number  of  test  items  that  can  be  used,,  and  what  is  the 
minimum  advancement  score?   [Assume,,  the  prior  is  given  by  0(10.254,  1.746).] 

Answer:     For  12  items  with  an  advancement  score  of  11,  we  have 
Prob(7T>.9|ll,12)  =  .48, which  is  sufficiently  close  to  .50.      (Shorter  test 
lengths  can  be  chosen — 8  and  9  test  items — if  the  advancement  score  is 
set  at  100%.) 


196 


-15- 


Table  4.2.4.     Probability  Student's    Domain  Score  Is 

Greater  Than  ir0  Given  a  3(10.254,  1.746)  Prior 
Distribution' 


Mxnlmup 
Advancement 
Score 

No.  of 
Test  Items 

Posterior 
Distribution  . 

50 

55 

Criterion  Level— 
60    65    70  75 

"0 

80 

85 

90 

95 

»  • 

0 

o 

6(16.254, 

3.746) 

100 

X  *J 

100 

98 

96 

90 

78. 

60 

37 

15 

2 

-» 

/• 

Q 

0 

6(1.7.254, 

2*.  746) 

100 

X 

inn 

100 

99 

97 

92 

81 

62 

36 

10 

Q 

w  o 

Q 

0 

6(18.254, 

1.746) 

1 00 

100 

X  \J  \J 

100 

100 

99 

98 

94 

85 

e66 

32 

I 

y 

6(17.254, 

3.746) 

i  on 

x  wu 

i  nn 

99 

■  97 

92 

82 

65 

41 

17 

2 

Q 

o 

o 

y 

.3(18.254, 

2.746) 

1 00 

100 

100 

99 

98 

93 

84 

66 

39 

11 

y 

Q 

y  ** 

8(19.254. 

1. 746) 

100 

100 

100 

100 

100 

'98 

95 

87 

69 

34 

7 

1U 

8(17.254, 

4.746) 

100 

99 

97 

93 

84 

68 

47 

24 

7 

1 

8 

10 

6(18.254, 

3.746) 

100 

100 

99 

V 

98 

93 

84 

68 

45 

19 

3 

9 

10 

3(19.254, 

2.746) 

100 

100 

100 

99 

98 

95 

86 

69 

42 

12 

8 

11 

8(18.254, 

4.746) 

100 

99 

98 

94 

87 

72 

51 

27 

"s 

1 

9 

11 

8(19.254, 

3.746) 

100 

100 

100 

98 

95 

87 

72 

48 

22 

3 

10 

'11 

8(20.254, 

2.746) 

100 

100 

400 

100 

99 

96 

88 

72 

45 

13 

'9 

12 

6(19.254, 

4.746) 

100 

100 

■1 

99 

96 

89 

76 

55' 

30 

10 

1 

10 

12 

6(20.254, 

3.746) 

100 

100 

100 

99 

96 

89 

75 

52 

24 

4 

11  . 

12 

6(21.254, 

2.746) 

100 

100 

100 

100 

99 

96 

90 

75 

48 

14 

lote:      The  mean  and  mode,  respectively  of  8(10.254,  1.746)  are  .855  and  .925  a: 
for  this  distribution  Prob  (tt>tt0)  for  ttq  =  .  70,  .  75,  .80,  .85  are  .92, 
.86,  .75,  and  .59,  respectively.    A  close  look  at  these  distributional 
characteristics  will  help  a  decision  maker  determine  if  this  prior  dis- 
tribution is  a  realistic  characterization  of  his/her  beliefs. 

(Taken  from  Novick  and  Lewis,  1974.) 


1 


We  can  assess  the  effects  of  prior  information  by  looking  at  some 

i 

representative  situations  and  the  probabilities  associated. 


Situation 

Uniform  Prior 

0(10.^54,  1.746)  Prior 

Prob(iT*.8|6,8) 

.26 

.60 

Prob(ir>.8|l0,12) 

.75 

Prob(TT>.9|6,8) 

.05 

»15 

Prob(n>.9|lO,12) 

.13 

.24 

Prob(TT£.7|7,9) 

.62 

.92 

1 

Prober.  7 1 9, 11) 

.75 

.95 

.  r— 

These  situations  are  provided  as  examples,  but  the  messige  is  clear; 

specifying  a  prior  of  the  sort  in  situation  B  results  in  a' much  higher 

probability  statement  about  an  individual's  domain  score  exceeding  the 

cutting  score,  given  the  test  data.     According  to  Novick  and  Lewis: 

When  the  decision  maker  specifies  an  informative  prior 
distribution  he  is  saying,  in  effect,  that  he  wants  a- 
decision  which  will  have  a  high  probability  of  being 
correct  in  that  portion  of  the  decision  space  in  which 
he  thinks  the  student's  ability  truly  lies. 


198 


-17- 


B.    Losses  Associated  with  Incorrect  Decisions 
Before  discussing  Novick  and  Lewis!  tables  for  test  length,  we 
must  discuss  how  they  specify  loss  ratios.     This  is  critical  for  use  of 
the  tables.     This  is  consonant  with  the  formulations  for  the  decision- 
theoretic  approach  to  setting  cut-off  scores    (see  Unit  6),  but  we 
will  discuss  these  procedures  here  for  continuity.    The  following  two-fold 
table  of  losses  associated  with  decisions  can  be  constructed: 


Decision 


advance 


retain 


Domain  Score 


7r<7T/ 


Where 


3  cutting  score  on  domain  of  tasks 


a  =Jloss  associated  uUh  advancing  a  student  whose  true 
f  level  7t<tt     (false  positive  error) 

b  =  loss  associated  with  retaining  a  . student  whose  true 
level  7r>7r0  (false  negative  error) 


Suppose  a=  b . 


According  to  Novick  and  Lewis: 


Tf  it  were  no  more  serious  to  advance-  a  student  whose 
level  was  below  the  criterion  than  to  retain  a  student 
who  was  above,  we  would  be  behaving  optimally  if  we 
were  to  advance  students  with  posterior  probabilities 
above  .5  and  retain  the  others. 


9 

ERIC 


199 


-18- 


Novick  and  Lewis  further  point  out  that  if  the  lose  for  false 
advancement  is  twice  that  of  false  retention,  which  is  a  more  reasonable 
situation,  then  only  those  Students  whose  posterior  probabilities  are  greater 


than  \  =  *67  should  be  advanced. 
3  « 


More  generally, the  decision  rule  to  be  used  is  to  advance  a  student  if 

L 

his/her  test  score  is  such  that  (b)  [ Prob (tt>7t0  |  x, n)]  ^(a)  [  Prob (tt<tt0  |  x,n)]   and  retain 

him/her        if  this  is  not  true.    An  equivalent    procedure  is  to  compare  the 

J  2 

loss  ratio  %  (It  would  -be  —  above),  to  the  ratio 
b  1 

Prob  (it>tt0  |  x,n) 
.1     Prob(7T<7r0 1  x,n) 

Various  loss  rat.ios  are  specified  in  the  tables  to  be  discussed  next.  ^ 

C.    Test  Length  Specifications     .  * 

In  order  to  use  the  tables  that  follow,  one  must  specify  the  following: 

1.  A  criterion  level,  or  cutting  score,  tt0,  must  be  chosen  • 

2.  Prior  knowledge  of  student's     domain  score  must  be 
translated  into  a  prior  probability  distribution  of  the  $  fotm 
for  it  (Use  the  methods  described  in  Novick  and*  Jackson,  1974). 

3.  A  loss  ratio  £  for  the  relative  losses  associated  with  the  two 

b 

types  of  incorrect  decisions  mus.t  be  chosen. 
From  these  specifications,  the  tables  give: 

1.  recommended    test  lengths, 

2.  minimum  advancement  scores, 

3.  posterior  probability  that  the    domain  score  is 
greater  than  ttq,  given  the  test  data,  and 

4.  the  percentage  correct  specified  by  the  advancement  ruLe  for  the 
recommended  sample    size(s).  200 


Table  4.2.5  provides  some  beta  distributions  ana  corresponding  ' 
parameters;  it  will  be  helpful  to  individuals  setting  priors  * 

Before  providing  some  examples,  we  should  comment  on  Tables  4.2.8  and  4.2.9,  which 

appear  to  be  the  same.     If  carefully  scrutinized,  one  will  notice  that. the 

expected  values  of  the  prior  djati fbutlons  are  different, and  this  changes 

the  entries  in  the  body  of  the  table.     In  Table  4.2.8,  the  expected  value  of  the 

prior  is.  s8,  which  equals  7rQ,  while  in  Table  '4.2.9,  it  is  .85,  which  is  larger 
'                  •  * 

than  7T  .    The  sample  sizes  that  are  recommended  in  Table  10  are  clearly  more 

o  „ 

attractive":    For  instance,  for  a  beta  prior  3(12,3)  and  the  expected  value 

=  .8,  the  test  would  be  H  items  wiuh  an  advancement  score  of  1%, while  ^ 

when  the  expected  value  ■  .85,  the  recommended  test  drops  to  only  13  items 

with  a  passing  score  ef{  11.    Novick  and  Lewis*comment  as  followsj. 

When  loss  ratios  are  high  it  may  well  be  advantageous     *  %  / 

to  strengthen  the  training  program  to  the  extent  that  the 

the  mean  output  is  well  above  the  specified  criterion 

leVel.     This  will  make  it  possible  to  use  short  tests, 

or,  alternatively,  will  generally  reduce  the  risk- of 

incorrect  classification. 

OS 

Example  6; 

You  have  decided  on  a  cutting  score  irQ  =  .  8  and  your  prior  has  been 
computed  to  be  6(8,2).    You  have  decided  on  a  loss  ratio  of  2.5  (it  is  2.5 
times  as  costly  to  advance  someone  whose  TT<irQ  than  to  retain  someone  whose 
ii>tt  ).     What  is  the  recommended  test  length  and  advancement  score,  and  also 
the  associated  probability? 

Answer:     20  questions  with  an  advancement  score  of  17  (re:     85%  mastery 
based  on  test).     Prob (tt> . 8 1 17 , 20)  =  .72.    We  are  72%  sure  an  individual's 
domain  score  is  above  .80. 

201 


Table  4.2.5    Selected  Prior  Distributions  for  Advancement  Decisions 

4 

Effective  .  Prob  (ir,$ir<r„)* 

Prior  Prior, 


No. 

Distribution 

Sample  Size 

Mean 

.90-. 70  • 

.70-. 75 

.75-. 80 

.80-. 85 

.85-. 90  .90-1.00 

1 

8(5.6,  2.4) 

8 

.70 

.46 

.12 

.12  . 

.12 

-.10 

.08 

2 

8(6,2) 

8 

.75 

.33 

.12 

.13 

.14 

.13  . 

.15" 

3 

8(6.4,  1.6) 

8 

.80 

.21 

.10 

.12 

.15 

.16 

.26 

4 

8(6.8,  1.2) 

8 

.85 

.12 

."07 

.09 

.13 

.17 

.42 

5 

8(7.2,  .8) 

8 

.90 

,t>5" 

.04 

.06 

.09 

.14 

.62 

6 

B(7,  3) 

10 

.7oY 

.46 

.14 

.14 

.12 

.09 

.05 

7 

8(7.5,  2.5) 

10 

.75 

.32 

.13 

.15 

.15 

.13 

.12 

8 

8(8,  2) 

10 

.80 

.20 

.10 

.14 

.16 

.17 

.23 

9 

8(8.5,  1.5) 

10 

.85 

.10 

.07 

.10 

.14 

.  19 

•  40 

10 

8(9,  1) 

10 

.90 

.04 

.03 

.06 

•  .10 

.16 

.61 

11 

8(8.4,  3.6) 

12 

.70 

.47 

.15 

.15 

.12 

.08 

.01 

12 

8(9,  3) 

12 

•  '75 

.32 

.14 

.16 

.16 

.13 

.09 

13 

8(9.6,  2.4) 

12 

.80 

.18 

.11 

.15 

.18 

.18 

14 

6(10.2,  1.8) 

12 

.85 

.09 

.07 

.11 

.16 

.20 

.37 

15 

6(10.8,  1.2) 

12 

.90 

.03 

.03 

.06 

.11 

.  .17 

.60 

16 

6(10.5,  4.5) 

15 

.70 

.47 

.17 

.16 

.12 

.06 

.02 

17 

6(11.25,  3.  75) 

15 

.75 

.30 

.16 

.18 

.17 

.13 

.06 

18 

\  8(12.3) 

15 

.80 

.16 

.12 

.17 

.20 

.19 

.16 

19 

6(12.75,  2.25) 

15 

.85 

.07 

.37 

.12 

.18 

.23 

.33 

20 

6(13.5,  1.5) 

15 

.90 

.02 

.03 

.06 

.11 

.19 

.59 

Note: 

All  entries  have 
add  to  1.00. 

been  rounded 

to  two 

decimal  places  and 

smoothed  so 

that  the 

row  totals 

1 

202  ' 

• 

203 

• 

I 


Table  4.2.6    Recommended  Sample  "Si2es  and  Advancement  Scores 

♦ 


0  *  « 

Loss  Ratio 


Prior 
Distribution 

e(ir) 

1.5C.60) 

2.0(.67) 

2.5(.71)  . 

3.0(.75) 

3(5.6,  2.4)1 

(.70) 

6/8C.62) 

* 

10/13(.70) 

11/14(.74) 

12/15078) 

C<?,'  3)  ( 

(.70) 

6/8(.61) 

10/13(.69) 

11/14(.73) 

12/15(.77) 

i 

3(8.4,  3.6) 

(.70) 

6/8(.61) 

10/13(.68) 

11/14(.72) 

12/15(.76) 

3(10.5,  4.5) 

(.70) 

9/12(.62)2 

10/13C.67) 

11/14.0  71) 

U2/15( . 75) 

i 

General  Recommendations 

6/8(75%) 

10/13(77%) 

11/14(79%) 

12/15(80%) 

^priori,  Prob(Tr^. 
..54,  .53,  and  .53. 

70)  for^each 

of  the  four  prior 

distributions 

is  .54, 

2For  6/8,  Prob(irfc..70)  =  .598. 


204 


-22- 


Table  4.2.7    Recommended  Sample  Sizes  and  Advancement  Scores 


mm           _            1  £ 

7T      «    •  /!> 
0 

• 

t 

Prior 
Distribution 

tM 

1.5  (.60) 

T    Ann       D  0  ^  4  A 

LOSS  l\SLXO 

2.0  (.67) 

2.5  (.71) 

3.0  (.75) 

3(6,  2): 

(.75) 

8/10(.65) 

lC/20(.70) 

'  17/2K.74) 

10/00/  m\ 
18/22(.  77) 

6(7.5,  2.5) 

(.75) 

8/10C.64) 

l6/20(.69) 

17/2K.73) 

18/22(.76) 

6(9,  3) 

,  (.75) 

8/10(.63) 

16/20(.69) 

17/2K.72) 

,18/22  (.75) 

6(11.25,  3.75) 

(.75) 

8/10(.62) 

*  16/20(.68) 

17/2K.71) 

19/23(.  77)2 

8/10(80%) 

General  Recommendations 

16/20(80%)  17/21(81%) 

18/22(82%) 

Up^iori,  Prob  (tt*.  75)  -  .56,  .55,  .55,  and  .54,  respectively  for  the  four 
prior  distributions  used  in  Table  7. 
2For  18/22,    Prob (it*. 75)  «  .744. 


Table  4.2.8    Recommended  Sample  Sizes  and  Advancement  Scores 


tt    -  .80 

o 


Prior 
Distribution 

e(tt) 

1.5  (.60) 

Loss  Ratio 
2.0  (.67) 

2.5  (.71) 

3.0  (.75) 

B(»64,  1.6)1 

(.80) 

6/7(.66) 

7/8(,70) 

17/20C.72) 

19/22(.  78) 

6(8,  2) 

(.80) 

6/7(.65) 

7/8(.69) 

17/20(.72) 

19/22(. 77) 

6(9.6,  2.4) 

(,80) 

6/7(.64) 

7/8(.68) 

17/20(.7l) 

19/22(.76) 

6(12,  3) 

(.80) 

6/7(.63) 

7/8(.67) 

18/2K.73)2 

19/22(.75) 

General  Recommendations 
6/7(86%)  7/8(88%)  17/20(85%)  19/22(86%) 

^priori,  Prob(TT*.80)  -  .57;  for  8/10,  Prob (ir>.  80)  -  .55;  for  16/20,  Prob(Tra..80) 
=  .^4;  for  8.5/10,  Prob(TT>.80)  »  .67;  for  8.3/10,  Prob(Tr>.80)  =  .62;  for  9/10, 
Prob(iT>.80)  ■  .78. 


2For  17/20,' Prob(Tr>.80)  »  .70. 


-23- 


Table  4.2.9    Recommended  Sample  Sizes  and  Advancement  Scores 


tt    -  .80 

o 


Loss  Ratio 


Prior 
Distribution 

e0O 

1.5C.60) 

2.0C.67) 

2.5C.71) 

3.0C.75) 

3(6.8,  1.2)5 

(.85) 

8/10(.6A) 

9/1K.69) 

10/12C.72)1 

11/13C.76) 

6(8.5,  1.5)- 

(.85) 

8/10(.66) 

9/1K.70) 

10/12(.73)2 

11/13C.76) 

3(10.2,  1.8) 

(.85) 

8/10C.67) 

9/1K.71) 

9/1K.71)3 

11/13C.77) 

6(12.75,  2.25) 

(.85) 

8/10C.69) 

9/1K.72) 

9/1K.72)1* 

11/13(.78) 

General  Recommendations 

8/10(80%) 

9/11(82%) 

10/12(83%) 

11/13.(85%) 

iFor  5/6,  Prob  (tt^.  80)  =  .72. 
2For  5/6,  Prob  (tt^.  80)  =  .73. 
3For  10/12,  Prob  (tt>.80)  -  .74. 
"•For  10/12,  Prob  (irj..80)  »  .75. 


W  the  four  prior  distributions,  the  aprlori  probabilities  of    ,».80  are  .72. 
73     74    and  .75.    With  these  prior  distributions  and  with  7,10,  the 
posterior  probabilities  of    ».S0  are  .41.  .43.  .46.  and  .48. 


\ 


206 


Table  4,2,10    Recommended  Sample  Sizes  and  Advancement  Scores 

*0  =  .85 


Prior  11  Los3  Ratio 


Distributions 

e(ir) 

1.5  (.60) 

2.0  (.67) 

2.5  (.70) 

3.0  (.75) 

6(6.8,  1.2)1 

t.85) 

7/8(.62) 

9/10O70) 

17/19073) 

18/20O76)3 

6(8.5,  1.5) 

(.85) 

7/8(.C2) 

,9/10(.69) 

17/19072) 

19/2K.77) 

6(10.2,  1.8) 

(.85) 

7/8(.61) 

9/l0(.68) 

17/19(.72) 

19/2K.76) 

6(12.75,  2.25) 

(.85) 

7/8(.60) 

9/10(.67) 

17/19071)2 

» 

19/2K.75) 

General 

Recommendations 

7/8(87.5%) 

9/10(90%) 

17/J9  (89%)^ 

19/21(90%) 

^he  apriori  probabilities  for    ir^.85  are  .59,  .58,  .58,  and  .57. 
2For  10/11,  Prob(T>.85  =  .695). 
3For  19/21,  Prob(ir>.85  -  .78). 


207 


/ 


-25- 


Example  7 ; 

You  have  decided  on  a  cutting  score  uQ  -  .85  and  your  prior  is 
0(12.75,  2.25).    Your  loss  ratio  is  1.5.    Wiat  is  the  recommended  teat 
"length,  advancement  score,  and  associated  probability? 

■  > 

Answer:     8  questions  with  an  advancement  score  of  7  (re:  87.5%) 
Prob(TT>,85  |7,8)  ■  .60;  we  are  60%  sure  an  individual's  domain  score 
is  above  .85. 


n.  Suflfiastions 

We  suggest  that  the  tables  developed  by  Novick  and  Lewis  be  used  any 
time  you  can  specify  a  meaningful  prior.    As  mentioned  before,  the  tables 
developed  by  Millman  are  meaningful  only  for  quick  estimates.     They  give 
probabilities  of  test  data, given  domain  score,  what  is  really  needed  is 
the  probability  of  domain  score,  given  test  data.    We  recommend  that  if  there 
is  no  suitable  prior,  Table  4.2.1  in  this  section  be  consulted.  Also, 
for  such  a  situation,  the  methods  developed  by  Fhaner  and  Wilcox,  involving 
the  specification  of  an  indifference  zone,  should  be  considered. 

4.2.5    Fhaner  (1974)  and  Wilgox  (1976)  Use  of  Indifference  Zones 
What  follows  is  a  discussion  of  the fese  of  indifference  zones 
merging  the  work  of  Fhaner  and  Wilcox,  using  Wilcox's  notation.  The 
basic  situation  is  that  described  in  section  1,  and  is  similar  to  the 
section  on  Millman' s  procedures. 


208 


-26- 

The  binomial  distribution  can  be  used  to  estimate  the  probability  of  an 
examinee  whose  domain  score  is  tt  obtaining  a  test  score  of  x  items  out 
of  n  items. 


Prob  (x|tt)  =  (£)  ttx  (1-tt) 


n-x 


x 


where    ~  is  an  unbiased  estimate  of  tt, 

n  ^  ^ 

Tests  are  used  in  a  context;  the  context  for  criterion-referenced 

testing  in  decision  making,  where  the  test  score  will  be  used  to  classify 

■t 

individuals.  To  separate  individuals  into  mastery  states 

(mastery  versus  non-mastery),  a  cutting  score  "ir  is  established  such  that  if 
tt<tto  the  examinee  is  a  non-master;  if  the  examinee  is  a  master.  The 

tester  „has  only  the  test  score  x  to  work  with,  not  tt,  and  needs  to  decide 
if  n%  or.^0.  i  Hence,  there  is  the  risk  of  false  positive  errors  (tt<ttq  , 
but  the  examinee  passes  based  on  the  test)  or  false  negative  errors  (^ttq, 
but  the  examinee  fails  based  on  the  test).     Let  a  be  the  probability  of  a 
Type  I  (false  positive  error)  and  3  be  the  probability  of  a  Type  II  (false 
negative  error).    A  performance  score    n0  needs  to  be  established  such  tf  *t: 

Prob(x  ^n0  |  tt)  <  a  for  all  tt<tto 

Prob(x  <  n0|  tt)  <  6  for  all7r>TT0 
Since  a  -  1-3,  it  is  not  possible  to  keep  both  probabilities  at 
acceptably  low  levels.     An  explicit  solution  to  the  problem  13  generated 
by  establishing  an  indifference  zone.     Let  c  be  a  positive  constant,  and 
form  the  open  interval  (ttq-  c,  tt0+  c).     For  individuals  whose 
domain  score     is  close  to  ttq  (within  the  interval  from  tt0  -c  to  tt(j  +  c),  we 
are  "indifferent"  as  to  how  they  are  classified,  re:     there  is  negligible 


209 


9 

ERIC 


-27- 


loss  in  mi.classificaticnof  such  individuals.    For  individuals  whose  domain 
score  is  greater  than  nQ  +  c  or  less  than  «0  -  c,  we  want  to  be  reasonably 
certain  the  correct  decision  is  made.  Schematically, 


Domain  Score  Scale 


-c 


+c< 


0.0 


n0-c 


Kndif 


*o+c  . 


1.0 


ference 
Zone 


Thu«  far  we  have  been  working  with  the  domain  of  tasks.    We  must 
now  specify  procedures  involving  the  test  itself.    Let  a,  -  passing  score  or 
advancement  score  on  the  test.     Thus,  if  x*n0,  the  student  is  advanced; 
if  x<n0>  the  student  is  retained.    A  correct  decisis  is  made  for  the  stu- 
dent if  x<nD  and  w<%  or  x»n0  and  «...    Let  P*  be  a  number  such  that 
Our  goal  is-  to  establish  n  as  small  as  possible  (for  a  certain  n0)  so  that 
for  values  of  ,  not  in  the  indifference  zone,  the  probability  of  a  correct 

decision  is  at  least  P*.  ,v 

For  values  of  w*.0-c.  the  minimum  probability  of  a  correct  decision 

occurs  at  the  point  ifQ-  c  and  is  given  by 


a 


I      (n)  (*0-c>*  <1  "  "0+  c> 


n-x 


x=o 


Fo 


r  values  of  »>»    +  c,  the  minimum  probability  of  a  correct  decision 


occurs  at  the  point  nQ  +  c  and  is  given  by: 


B  =      I        O  (*o  +  C>X  ^  '  *o  -c> 
x=n0 


n-x 


ERIC 


210 


-28- 


Now.to  choose  n,    Wilcox  specifies: 

In  particular,  we  choose  the  smallest  integer  n  so 
that  a  and  6  are.  greater  than  or  equal  to  P*  which 
implies  that  the  probability  of  a  correct  decision 

is  at  least  P*  for  7r>7r0  +  c  and  ir<ir0  -  c. 


Wilcox    provides  tables  for  various  combinations  of  the  variables  involved 
in  the  formula.  In  order  to  use  these  tables,  the  following  must  be  specified: 

1.  ttq:    The,  cutting  score  for  the  domain  of  items.  Wilcox 
specifies  the  7r0's  to  be  .70,  .75,  .80,  .85.. 

2.  c:    The  positive  constant  that  forms  the  indifference  zone. 
Wilcox    uses  'c  =..05  and  c  =  .10.  Thus,  for  7TQ  *  .75  and  c  =  .10, 
we  are  indifferent  as  to  our  classification  for  scores  in 

the  interval  (.65,  .85). 

3.  P*:    The  minimum  probability  of  a  correct  decision  for  scores 
not  in  the  indifference  region.      Wilcox  uses  P*  -  .75. 

By  specifying  these  values,  Wilcox's  table  then  gives  you  n  and  nc, along 
with  the  probability  of  correctly  classifying  an  examinee  with  a 
domain      score  -^o*0  or  -^o""0' 

Example  8; 

Suppose  tto>  the      cut-off    score    =    .80,       c  =  .1  and  P*  -  .75. 
What  is  the  least  number  of  questions  that  can  be  used  to  have  greater  than 
a  75%  chance  of  correctly  classifying  an  examinee  in  the  interval  greater 
than  tt0+c  (  =  .9)  and  less  than   rTT0-c  (  »  .7). 

Answer:     For  tt0  =  .8,  c  -  .1,  the  least  number  of  questions  n  is  9  with 
an  advancement  score  of  8. 


211 


9 

ERIC 


Table  4.2.11 


Cut-off  Scores  and  the  Minimum  Probability  of  a  Correct  Decision 
for  Values  of  it  not  in  the  Indifference  Zone 


'o 


-  .70 


tt0  =  «75 


IT, 


.85 


n 

c=.05 

c=.10 

c=.05 

c<  ,10 

c=.05 

c=.l0 

c= .  05 

c=  .  10 

8 

6/. 5722 

6/. 6846 

7/. 5033 

7/. 6572 

7/. 6329 

7/. 7447 

7/.  496'/ 

^/.6329 

9 

V1//.6007  • 

7/. 7382 

7/. 5372 

7/. 6627 

8/. 5995 

8/. 7748 

3^5683 

o/.«  oyy  / 

10 

8/. 5256 

8/. 6778 

8/. 6172 

8/. 7384 

9/. 5443 

9/. 7361 

9/  •6242 

y / • /3DU 

11 

'8/.  5744 

8/. 7037 

9/. 6174 

9/.  7788, 

9/. 5448 

10/. 6974 

10/ . 6779 

iu/ • ouzy 

12 

9/. 6488= 

9/. 7747- 

10/. 5583 

10/. 7358 

10/. 6093 

10/. 7472 

11/. 6590 

11/. 8416 

1  J 

in/  sfiAT 

10/ . 7473 

10/. 5794 

11/. 7296 

11/. 6674 

11/. 7975 

12/. 6213 

12/. 8646 

14 

10/. 5733 

10/. 7207 

11/. 6488 

11/. 7795 

12/. 6479 

12/. 8392 

13/. 5846 

13/. 84 70 

15 

11/. 6481 

11/. 7827 

12/. 6482  • 

12/. 8227 

13/. 6042 

13/. 8159 

13/. 6020 

14/. 8290 

16 

12/. 6302 

12/. 7982 

13/. 5981 

13/. 7899 

13/. 5950 

14/. 7892 

14/. 6482 

15/. 8108 

17 

12/. 5803 

13/. 7582 

13/. 6113 

13/. 7652 

14/. 6470 

14/. 7981 

15/. 6904 

15/. 8363 

18 

13/. 6450 

13/. 7912 

14/. 6673 

14/. 8114 

15/. 6943 

15/. 8354 

16/. 7287 

16/. 8647 

19 

14/. 6678 

14/. 8369 

15/. 6733 

15/. 8500 

16/. 6841 

16/. 8668 

17/. 7054 

17/. 8887 

20 

14/. 6172 

15/. 8042 

16/. 6296 

16/. 8298 

1    17/. 6477 

17/. 8670 

18/. 6769 

18/. 9087 

(Reproduced  from  Wilcox,  1976,  permission  pending.) 


ERIC 


212 


/ 


-30- 

Example  9 : 

Suppose  7rQ  »  .  7  and  c  53  .  05*  On  a  15  item  test  with  an  advancement 
score  of  11,  what  is  the  probability  of  correctly  classifying  an  individual 
with  a     domain    score  -  .75  (  ■  tt0+c)  or  <  .65  ("it  -c)»  C 

Answer;  65% 

A,  Summary  of.  the  Procedure 

Because  of  the  complexity  of  this  formulation,  a  summary  would  appear 
helpful*    One  wants  tc  minimize  the  probabilities  of  incorrectly  classifying 
individuals  based  upon  a  test  score,  i.e.,  one  wants'1  to  minimize  the  probability 
of  a  false  positive  error  (a)  and  the  probability  cf  a  false  negative  error  (B) 
simultaneously.     Since  a  ■  1  -  3,  minimizing  one  will  maximize  the  other. 
The  problem  is  circumvented  by  specifying  an  indiffereuce  zone,  and  then 
oc  and  6  can  be  simultaneously  minimized  at  the  boundary  points  of  the  in- 
difference zone.     Wilcox  has  prepared  tables' that  relate  the  number  of  test 
items  to  a  and  3, for  specific  indifference  zones, in  terms  of  probability  of 
correct  decisions  outside  the  indifference  zone.  ' 

B.  Comments 

Consider  next  the  work  of  Fhaner  and  Wilcox: 


1.  If  c35  0,  that  is,  there  is  no  indifference  region,  it  is  not 

always  possible  to  choose  n  such  that  the  probability  of  a  correct 
decision  is  at  least  P*.      Wilcox  says  that  for  this  situation 
the  probability  of  a  correct  decision  approaches  .5  (an  unac- 
ceptable level)  as  n  increases.    Hence,  Millman's  solution  may 
not  be  adequate  for  certain  situations*) 1  o 


-31- 

2.  If  the  loss  in  misclassifying  ari  individual  who  has  obtained 
mastery  (it  >tt0+c)  is  different  from  the  loss  in  misclassifying 

a  non-master  (it<tt0-c),  chen  two  numbers  Pj*  and  ?2*  can  be  chosen 
such  that  Js<P1*<l,  and  *s<P2*<l  and  there  is  a  smallest  integer  n 
so  that  a>Pi*  and  B>P2*« 

3.  If  n  is  large,  a  theorem  in  statistics  called  the  Central  Limit 
Theorem    justifies  the  use  of  the  normal  distribution  in  place 

i 

of  the  binomial.     In  this  case,  tables  of  the  normal  distribution 
.  function  ($)  may  be  used,  and  use  of  the  Wilcox  tables  can  be 
circumvented.     In  this  case,  the  number  of  test  questions  is 
given  by: 


n  = 

2c 


where    n  =  number  of  litems 

c  =  positive  constant  (from  before) 
it    =  cutting  score 
zl-ct  "  deviation  score  in  a  standardized  normal  distribution 

corresponding  to  1-a 
Zl-g  =  deviation  score  in  a  standardized  normal  distribution 

corresponding  to  1-8 . 
Fhaner  notes  that  the  normal  approximation  underestimates  the 
number  of  items  needed*      Wilcox  notes  that  the  procedure 
does  not  give  you  an  optimal  n0  (performance  or  advancement 

score).  Hence,  a  uaer  needs  to  be  careful  when  making  use  of 
the  normal  approximation. 


214 


% 


I 


-32- 


4.2.6    Eignor-Hambleton  Approach  for  Determining 
Test  Length1 

Methods'  for  determining  test  length  which  depend  upon  the  mini- 

t 

mization  of  classification  errors  were  presented  in  the  last  three 

J  ' 

sections.    An  alternate  approach  in  which  test  length  is  related  directly 
to  several  indices  of  test  score  reliability  and  validity  is  presented 
in  this  section.    At  this  point  it  would  be,,  desirable  for  the  reader  to 
study  material  presented  in  Unit  5  on  test  score  reliability  before 
proceeding  further. 

A.  Introduction 

A  primary  concern  of  individuals  using  test  scores  is  that  the  scores 
be  both  reliable  and  valid.    While  the  bedt  approach  for  assessing  test 
score  reliability  and  validity  will  depend  on  the  particular  situation, it 
is  well-known  that  there  is  a  relationship  between  the  length  of  a  test, 
the  advancement  score,  and  the  reliability  and  validity^  the  test 
scores.     Longer  tests  result  in  test  scores  with  better  psychometric 
properties. 

For  norm-referenced  tests,  the  relationship  of  test  length  to 
reliability  can  be  expressed  by  the  Spearman- Brown  formula.  Also*3 
formulas  exist  that  relate  norm-referenced  test  lengf:h  to  test  score 
validity.    However,  because  these  formulas  are.  based  upon  a  correla- 
tional approach  to  reliability  and  validity,  they  are  not  very  useful 
with  criterion-referenced  tests  when  the  intent  of  the  criterion- 
referenced  test  is  to  produce  scores  for  making  mastery /non-mastery 


Material  in  this  section  is  from  a  paper  by  Eignor  and  Hambleton 
(1979).    Additional  results  are  reported  in  Eignor  (1979). 

erJc  215 


1 


-33- 


decisions  (Hambleton,  Swaminathan,  Algina,  &  Coulson,  1978).    What  Is 
often  of  interest  to  users  of  criterion-referenced  tests  is  informa- 
tion concerning  the; consistency  of  mastery /non-mastery  decisions  lor 
some  group  of  examinees  across  a  retest  administration  or  across  a 
parallel-form  administration.    Second,  there  is  usually  considerable 
interest  in  the  extent  of  agreement  between  mastery/non-mastery 
decisions  based  on  a  criterion-referenced  test  and  the  "true11  mastery 
states  of  a  group  of  examinees  (sometimes  called  "decision  accuracy'1)  • 
(The  "true"  mastery  state  of  an  examinee  is  the  one  he/she  should 
be  assigned  to,  based  on  the  amount  of  knowledge  or  skill  he/she 
possesses  relative  to  the  objective  or  competency  under  investigation.) 
These  two  situations  described  above  correspond  to  one  paradigm  for 
viewing  the  psychometric  concepts  of  criterion-referenced  test  ncore 
reliability  and  validity,  respectively. 

Hambleton  et  al.  (1978)  distinguished  between  uses  of  criterion- 
referenced  test  scores,  domain  score  estimation  and  allocation  of 
examinees  to  mastery  states.    For  the  first  use,  the  test  length  rela- 
tionship  to  reliability  can  be  derived,  and  may  be  summarised  in  the 
well-known  item  sampling  model  (Lord  and  Novick,  1968) .    It  is  for  the 
other  major  use  of  criterion-referenced  test  scores,  mastery  state 
determination,  that  necessary  technical  developments  are  in  short 
supply.    Little  research  has  been  done  that  directly  explores  the 
relationships  of  test  length  and  cut-off  scores  to  criterion-referenced 
test  score  reliability  and  validity  when  the  scores  are  used  for 
assigning  examinees  to  mastery  states. 


er|c  216 


-34-  *  y*" 


What  research  has  bern  done  has  focused  either  (1)  on  procedures 
for  determining  reliability  of  examinee  assignments  to  mastery  states 
(Hambleton  &  Novick,  1973}  Swaminauhan,  Hambleton,  &  Algina,  J  974; 
Huynh,  1976;  Subkoviak,  1976,  1978a,  1978b;  Marshall  &  Haertel,  197.6; 
Algina  &  Npe,  1978)  or  (2)  on  procedures  for  the  determination  of  test 
length  that  minimizes  misclassification  errors  (MiiXraan,  1973;  Novick 
&  Lewis,  1974;  Fhaner,  1974;  Wilcox,  1976,  1977).    The  research  reported 
in  this  paper  is  directed  toward  linking  together  these  two  areas  of 
research  and  providing  useful  results  to  test  practitioners  to  enable 
them  to  determine  test  lengths  to  fit  the  situations  in  which. their  tests 
will  be  used. 

Specifically,  %he  purpose  of  the  study  was  two-fold: 

1.  To  report  the  relationships  between  test  lengths  and 
.several  reliability  and  validity  indices  for  a  fixed 

cut-off  score  (80%)  in  five  domain  score  distributions. 

2.  To  report  the  relationships  between  advancement  scores 
and  several  reliability  and  validity  indices  for  several 
test  lengths  and  in  five  domain  score  distributions. 

The  study  was  carried  out  using  computer  simulation  methods.    The  one 
major  advantage  of  this  approach  is  that  it  is  possible  "to  know" 
examinee  domain  scores  and  their  "true"  mastery  states.    Such  informa- 
tion permits  one  to  compare  examinee  estimated  domain  scores  and 
asslSned-  ^stery  states    based  on  test  results  with  domain,  scores  and 
true  mastery  states.     Summary  of  such  comparisons  address  the  validity 
of  the  particular  set  of  test  scores  under  investigation. 


217 


-35- 


B.     Research  Design 


Terminology  * 

Test  length  refers  to  the  number  of  test  Items  that  are  used  to 

■ 

measure  examinee  performance  on  a  particular  objective.    A  domain 
score  for  an  examinee  is  defined  as  the  proportion'  of  items  in  the  . 
domain  of  items  measuring  an  objective  that  the  individual  can  answer 
correctly.    A  cut-off  score  is  set  on  the  domain  score  scale  [0,l] 
to  separate  examihees  into         .    two  true  mastery  states . 

Since  all  items  in  the  domain  of  items  defined  by  an  objective 

s 

cannot  usually  be  administered  to  examinees  for  the  purpose  of  assess- 
ing their  domain  scores  or  assigning  them  to  mastery  states ,  a  sample 

of  test  items  is  chosen.    Estimated  domain  score  is  defined  as  the 

*** 

proportion  of  items  that  an  examinee  answers  correctly  of  the  items 
included  in  the  test.    An  advancement  score  is  defined  as  the  number 
of *  items  on  the  test  measuring  an  objective  deemed  necessary  for  an 
individual  to  answer  correctly  to  be  classified  as  a  master. 

In  using  an  examinee's  test  score  to  determine  his/her  true 
mastery  status,  two  types  of  classification  errors  can  result,  A 
false-positive  eyrdr  occurs  when  an  examinee  is  estimated  to  be  a 
master  when  his/her  true  status  is  npn-master;  a  false-negative  error 
occurs  when  an  examinee  is  estimated  to  be  a  non-master  when  his/her 
true  status  is  master. 

Variables  Under  Study  x  u 
(a)  Test  Model  ,  . 

Both  the  binomial  and  compound  binomial  models  were  used  to 

simulate  examinee  item  response  data.    While  criterion-referenced  test 


ERJC  218 


data  has  often  been  assumed  to  fit  the  binomial  model,  Lord  (1965), 
and  ftore  recently,  Wilcox  (1976,  1977),  have  suggested  that  the  com- 
pound  binomial  jnodel  may  be  appropriate.    The  binomial  model  assumes  - 

0  ' 

that  the  probability  of  a  correct,  response  for  an  examinee  is  the 

.'v 

same  across  all  items  on  a  test;  or  alternatively,  that  all'  items  are 
equally  difficult  (for  that  examinee).    The  pompound  binomial  model 

A 

assumes  that  the  probability  of  correct  response  for  an  examinee  varies 
ar  :oss  items  in  a  test,  or  that  the  items  are  not  equally  difficulty 
(icr  tfiat  examinee) .    -The  latter  assumption  is  considerably  more 
plausible  but  investigations  that  have  utilized  both  models  (for  instance 
Subkoviak,  1976)  have  demonstrated  different,  but  not  very  much 
different  results  from  the  use  of  the  two  models. 

y 

(b)  Prior  Distributions 

For  the  binomial  model,  either  a  user-supplied  .or  a  beta  prior 

distribution  on  domain  scores  waa  specified  and  examinee  domain  scores 

/*\  . 

sampled  from  this  distribution.    For  the  user-supplied  prior,  a  per- 
centage of  examinees  were  assigned  to  each  of  ten  equal  Intervals  from 
0.00  to  1,00,  and  a  domain  score  distribution  was  constructed  from  this 
information*     The  percentages  assigned  to  the  intervals  reflect  the 
prior  belief  about  how  the  group  would  perform  on  the  domain  of  tasks 
of  which  the  test  is  a  sample.    An  examinee's  domain  score  was  then 
sampled  from  this  prior  distribution,  and  this  domain  score  used  to 
simulate  binomial  model  t^st  performance.    This  process  was  then  re- 
peated for  200  examinees. 


219 


-37- 


Wfi^n  the  prior  domain  score  distribution  was  specified  as  a  beta 
prior,  the  fractile  assessment  procedure  (Novick  &  Jackson,  1974)  was 
used  to  specify  the  parameters  of  the  beta  distribution,  and  then  a 
1MSL  Subroutine  (GGBTA)  used  to  generate  the  distribution.    The  justi-  . 

■i 

*  C 

fication  for  using  a  beta  prior  distribution  stems  from  two  facts. 

One,  the  beta  distribution  is  defined  on  the  interval  [0,l]  (whereas 

> 

most  other  distributions  are  not).    Second,  the  beta  distribution  allows 
the  user  to  easily  generate  skewed  domain  score  distributions  to  ap- 
proximate  distributions  that  might  be  expected  to  occur  with  real 
criterion-referenced  test  data. 

The  fractile  assessment  procedure  (FASP)  has  been  offered  by 
Novick  and  Jackson  (1974)  as  a  means  for  specifying  the  parameters 
of  a  beta  distribution.     The  user  is  asked  to  specify  qp  q2>  and  q^, 
the  first,  second  (median) ,„ and  third  quar tiles  of  the  distribution. 

The  parameters,  a  and  b,  of  the  beta  distribution  are  then  (approxi- 

v 

mately)  given  by: 


where 


where 


"  cq2  +  \  and  b  =  c^1~^2^  +  \ 


c  -  .057  (  7   +  ^  ) 
dl  d3 


dl  = 


[q2  a-qi)]1*  -  hx  (i-q2)]i2_ 


and 


Cq2  d-q3)] 


[q3  (1^)]^ 


220 


r 


-38- 


The  parameters  a  and  b  wejre  then  used  as  input  to  the  GGBTA  Subroutine, 
which  generated  beta-distributed  domain  scores.    As  with  the  "user- 
supplied"  prior  distribution,  domain  scores  were  then  used  to  simulate 
examinee  binomial  model  test  performance. 

Domain  scores  for  use  with  the  compound  binomial  model  were  gen- 
erated from  a  normal  distribution  (mean  =*  1,  standard  deviation  B  1)  and 
then  rescaled  to  the  interval  [0,l].    This  step  .  (and  others  done  with 
the  compound,  binomial  model)  was  carried  out  with  the  aid  of  computer 
program  DATAGEN  (Hambleton  and  Rovinelli,  1973).    In' the  past  it  has 
been  regularly  used  to  generate  logistic  test  model  data. 

Additional  details  on  the  five  domain  score  distributions  used  in 
the  study  are  reported  in  Table  1. 

(c)  Advancement  Scored 

In  addressing  the  first  purpose  of  the  study,  advancement  scores 
were  always  set  exactly  equal  to  the  chosen  cut-off  score  of  .80.  This 
was  made  possible  because  of  the  test  lengths  under  consideration. 

In  the  second  part  of  the  study,  for  several  fixed  test  lengths, 
advancement  scores  were  moved  around  with  the  same  test  data  sets,' 
to  determine  the  influence  of  advancement  score  placement  on  indices 
of  test  score  reliability  and  validity. 

(d)  Test  Lengths 

Test  lengths  of  5,  10,  15,  20,  and  40  were  considered  in  this' 
particular  study.    Many  other  test  lengths  (and  advancement  scores) 
were  considered  by  Eignor  (1979). 


221 


9 

ERIC 


Table  1 


Descriptions  of  the  Five  Domain  Score  Distributions 


Distribution 

Test 
Model 

Skewness 

Domain  Score 
Distribution  Description  * 

1 

Binomial 

Moderate  Negative 

(a) 
(b) 
(c) 

Mode  is  slightly  below  the  cut-off  score  (.80). 
Range  of  scores  is  [.11,  1.00]. 

About  50%  are  on  the  interval  [.60,  .80]and  80%  on  the 
interval  [.50  to  .90]. 

Z 

Binomial 

High  Negative 

(a) 
(b) 

Leptokurtic  distribution  with  the  mode  above  the  cut-off 
score  (.80). 

Range  of  scores  is  [.60,  1.00]  with  about  80%  of  the 
scores  on  the  interval  [.80,  1.00]. 

3 

Binomial 

Slight  Negative 

(a) 
(b) 

(c) 

Mode  is  far  below  the  cut-off  score. 

50%  on  the  interval  [.00,  .49]  and  50%  on  the  interval 

[.50,  1.00], 
Substantial  variation  of  scores. 

4 

Compound 
Binomial 

Moderate  Negative 
• 

(a) 
(b) 
(c) 

(d) 

Mode  is  close  to  the  cut-off  score. 

Wide  range  of  domain  scores  [.00,  1.00]. 

50%  on  the  interval  [.00,  .79]  and  50%  on  the  interval 

[.80,  1.00], 
Flatter  distribution  thtn  either  (1)  or  (2) . 

5 

Compound 
Binomial 

None 

(a) 

An  almost  rectangular  distribution  on  the  interval 
[.20,  .90]  with  domain  scores  but  fewer  of  them 
below  .20  and  above  .90. 

/ 


223 


-40- 


Reliabllity  and  Validity  Indices 

A  number  of  practical  indices  of  test  score  reliability  and 
validity  were  used  in  the  study.    The  two  diagrams  below  will  facili- 
tate  their  discussion. 


Diagram  One 
Test  Occasion  Two 


Diagram  Two 


Criterion  Measure 


NM 

M 

NM 

M 

NM 

Test 
Occasion 
One 

p00 

P01 

.J 

NM 

Test 
Results 

p00 

P01 

M 

p10 

pll 

M 

P10 

pll 

(M  »  Mastery  status;    NM  -  Non-Mastery  status) 


The  contingency  table  in  diagram  one  shows  the  proportion  of 
examinees  falling  in  the  four  possible  combinations  of  mastery  state 
assignments  based    on    parallel-form  (or  test-retest)  administrations 
of  a  criterion-referenced  test.    The  only  difference  in  diagram  two  is 
that  a  criterion  measure  is  substituted  for  a  parallel-form  of  the 
criterion- referenced  test  under  study. 

Two  reliability  indices  are  derivable  from  data  reported  in 
Diagram  One; 


1.    Decision  Consistency 
1 

DC  =  £ 


k=0 


pkk 


(Hambleton  &  Novilck,  1973) 


'24 


9 

ERIC 


2.  Kappa 

DC-CA 

k  -  (Swaminathan,  Hambleton,  &  Algina,  1974) 

1-CA 

1 

where  CA  (chance  agreement)  *     E    p,  p, 

k=0   k*  * 

and  pQt,  p^t,  and  p^,  p  ^  are  the  respective  marginal  proportions 

for  the  first  and  second  test  administrations. 

There  are  three  derivable  validity  indices  from  Diagram  Two; 

3.  Decision  Accuracy 


DA  -    S    pkk  (Hambleton  and  Novick,  1973) 

k»0 


4.  Predictive  Validity  (the  Pearson  correlation  between  decisions 

based  on  the  criterion-referenced  test  and  the  criterion 

■. 

measure) 

5.  Efficiency 

n  A  ; 

E  -   (Livingston,  1978) 

where  ttq  is  a  cut-off  score  defined  on  the  domain  score  scale, 
tt0  is  an  advancement  score ,  n±  is  the  domain  score  for  examinee  i 
and  tt^  is  the  estimated  domain  score  for  examinee  i. 

All  of  the  statistics  are  well-known  and  commonly  used  in  criterion- 
referenced  testing  practice  except  for  the  last  one  (and  this  is  at 
least  partially  due  to  its  newness).    Essentially,  efficiency  is  a 
measure  of  how  accurately  a  criterion-referenced  test  and  associated 


225 


-42- 

advancement  score  result,   in  the  assignment-  of  examinees  to  mastery 
states  that  are  in  agreement  with  decisions  based  on  a  criterion  measure. 
Also,  the  loss  in  efficiency  due  to  misclassifying  examinees  (false- 
positive,  and  false-negative  errors)  is  linearly  related  to  the  difference 
between  an  examinee's  level  of  performance  on  the  criterion  test  and 
the  criterion  test  cut-off  score.      Clearly,  Livingston's  efficiency 
does  not  address  directly  the  validity  of  mastery  classifications. 
The  index  was  included,  in  the  study  because  it  provides  an  alternate 
but  potentially  useful  framework  for  viewing  criterion-referenced 
test  score  validity. 


* 


i 


226 


Data  Generation 


The  process  of  generating  examinee  item  scores  and  test  scores 

and  summary  statistics  on  200  examinees  for  various  sets  of  testing 

conditions  was  completed  as  follows: 

1*    One  of  the  domain  score  distributions  from  Table  1  and  a 
test  model  (binomial  or  compound  binomial)  was  selected* 

2.    Examinee  domain  scores  were  generated  and  examinees  with 

domain  scores  equal  to  or  above  .80  were  assigned  to  a  mastery 
mastery  state  on  the  criterion  measure.    All  other  exami- 
nees were  assigned  to  a  non-mastery  state. 

3*    For  the  particular  test  length  under  consideration,  examinee 
domain  score  estimates  were  generated.    For  the  binomial  test 
model,  this  was  done  by  setting  the  probability  of  a  correct 
response  for  each  item  equal  to  the  examinee's  domain  score. 
By  generating  random  numbers  uniformly  distributed  on  the 
interval  [0,  l],  it  was  possible  to  simulate  the  examinee's 
test  item  performance  properly  (i.e.,  answering  each  item 
with  a  probability  of  being  correct  equal  to  his/her  domain 
score).    Two  sets  of  item  scores  were  generated  to  simulate 
two  test  performances. 

For  the  compound  binomial  model,  "item  characteristic  curves" 
were  generated  (see  an  example  in  Figure  1).    From  Figure  1 
it  is  clearly  seen  that  the  probability  of  correct  answers 
varies  not  only  from  one  examinee  to  another  but  also  for  the 
same  examinee  from  one  item  to    another.    Once  probabilities 
for  answering  items  for  a  given  examinee  are  obtained.  Item 
scores  via  the  use  of  a  random  number  generator  **<sre  obtained. 

4.  From  the  examinee  item  scores  obtained  in  step  2,  examinee 
test  scores  are  obtained  by  summing  the  number  of  correctly 
answered  test  items* 

5.  Each  examinee  was  assigned  to  a  mastery  state  based  on  a 
comparison  of  his/her  estimated  domain  score  and  the  advance- 
ment score.    Two  assignments  were  made,  one  for  each  test 
administered. 

6.  The  five  summary  statistics  were  calculated. 

7.  Steps  1  to  6  were  repeated  for  each  of  five  domain  score 
distributions,  and  five  test  lengths  (5,  10,  15,  20,  and  40 
test  items).     In  addition,     for    two    test  lengths  (5 

and  10),  the  summary  statistics  were  calculated 

for  three  advancement  scores,  one  at  80%,  and  one  below  and 

another  one  above  80%. 


227 


to 


too 


.00      .10      .20      .30      .40      .50       .60      .70      .80      .90  1.00 

Domain  Score  Scale 

Figure!        Item  Characteristic  Curves  of  Four  Test  items 
and  Probabilities  of  Correct  Answers  for  Two  Examinees 


9 

ERIC 


228 


-45- 


C.    Results  and  Discussion 

Effects  of  Test  Length  on  Selected 

Test  'Score  Reliability  and  Validity  Indices 

Figures  2  to  6  provide  the  relationships  between  test  length 
and  decision  consistency,  kappa,  decision  accuracy,  predictive  validity, 
and  efficiency,  respectively,  for  each  of  the  five  domain  score  dis- 
tributions under  consideration.     In  preparing  the  figures,  statistical 
data  were  available  for  each  of  the  domain  score  distributions  at 
six  test  lengths:     0,  5,  10,  15,  20,  and  AO  items.     Curves  were  drawn 
to  be  monotonically  increasing,  non-intersecting,  and  as  close  fitting 
to  the  data  points  as  possible. 

A  number  of  observations  and/or  cautions  concerning  the  use  of 
Figures  2  to  6  are  offered  next: 

1.  Test  score  validity  indices  are  lowest  with  homogeneous 
domain  score  distributions  centered  at  or  near  a  cut-off 
score.    Domain  score  distribution  one  (and  to  a.  lesser 
extent)  distribution  two  reflect  this.    The  validity 
indices  are  highest  for  homogeneous  domain  score  distri- 
butions where  the  cente.r  of  the  distribution  is  far  from  • 
a  cut-off  score.    These  findings  have  several  implications: 

(a)  Short  tests  can  be  used  when  there  is  reason  to 
believe  that  a  group  of  examinees  will  do  either 
very  well  or  very  Ipoorly  on  a  particular  test. 
(Of  course,  if  the  prior  belief  about  the  dis- 
tribution of  domain  scores  is  highly  inaccurate, 
test  score  reliability  and  validity  indices  will 
,  be  considerably  lower  than  those  predicted  from 
the  figures.) 

2.  Figures  2  to  6  apply  to  the  case  v q  =  ir    -  .80.     Such  a 
situation1  is  common  in  practice  but  variations  in  cut-off 
scores  and  advancement  scores  from  .80  will  reduce  the 
usefulness  of  the  results  reported  in  the  figures. 

3.  Details  for  using  the  figures  in  test  development  work 
will  be  offered  later  in  the  paper.     It  suffices  to  say 
here  that  the  more  important  figures  are  those  connecting 
test  length  to  the  validity  indices.     After  an  initial 
determination  of  test   length  lias  been  made,  Figures  2  and 
3  can  be  used  to  predict  test  score  reliability.     If  it  l« 


,er|c 


230 


■  i  i  i  i  ■  ■  i  i  '  a  i  i  i  i  I  I  1  l  1  l  i  1  l  1  i  1  i  1  1  i  1  1  1  '  '  1  1  1  1 
0  5  10  15  20  25  30         35  40 

Number  of  Test  Items 

Figure  2.        Relationship  Between  Decision  Consistency  and 
Test  Length  with  Five  Test  Score  Distributions 


ERIC 


231- 


232 


Number  of  Test  Items 

Figure  3.        Relationship  Between  Kappa  and  Test  Length 
with  Five  Test  Score  Distributions  234 


1.00 


Number  of  Test  Items 

Figure 4.        Relationship  Between  Decision  Accuracy  and 
Test  Length  with  Five  Test  Score  Distributions 

ER£      #  235     '  •  236  • 


I 
I 


9 

ERIC 


.00 

0  5  10         15  20  25 

Number  of  Test  Items 

Figure  5.       Relationship  Between  Predictive  Validity  and 
2yr  Test  Length  with  Five  Test  Score  Distributions  238 


oo  *  '  '  '   .  ■  ■  t  i  ■  i  ■  ■  ■  ■  ■  ■  ■  i  ■  i  ■  i  i  i  i  i 

0  5  10         15  20         25         30         35  40 

Number  of  Test  Items 

Figure  6.       Relationship  Between  Efficiency  and  Test 
Length  with  Five  Test  Score  Distributions 


I 

o 


er?c39 


240 


'  -51- 

not  high  enough  to  meet  some  specified  standard,  the  test 
plan  must  be  revised  to  lengthen  the  required  test. 

4.    Unfortunately,  for  many  of  the  test  lengths  under  consider- 
ation tt0  cannot  equal  t?0  (for  example  with  ir0  ■  80,  and  an 
eight  item  test,  rf0  can  be  set  equal  to  .75  or  .875  hut  not 
.80).    For  test  length  and  reliability  results,  the  direction 
of  errors  in  the  figures  will  depend  on  the  relation  between 
tt0  and  the  mean  of  the  domain  score  distribution.  Decision 
consistency  is  monotonically  related  to  the  difference  between 
tt0  and  IT.    The  bigger  the  difference,  the  higher  the  value 
of  decision  consistency  will  be.     On  the  other  hand,  for 
kappa,  the  highest  values  are  obtained  when  t0  and  ff  are 
fairly  close. 

For  test  length  and  validity  results,  the  direction  of 
errors  appears  to  depend  in  a  complicated  way  on  the 
relations  among  tt0,  tt0,  and  it.    More  will  be  said  about 
this  in  a  later . section. 


Effects  of  Advancement  Score  on  Test.  Score 
Reliability  and  Validity  Indices 

It  was  mentioned  earlier  that  it  is  not  always  possible  to  set 

s 

a  cut-off  score  and  an  advancement  score  equal  to  the  same  value.  Some- 
times it  is  not.  even  desirable  to  do  so  when  the  opportunity  is  avail- 
able.    For  example,  if  false-positive  errors  are  considerably  more 
serious  than  false-negative  errors,  a  test  user  may  choose  to  set  a 
very  high  advancement  score  and  thereby  minimize  the  number  of  false- 
positive  errors.     Such  an  action  however  will  influence  test  score 
reliability  and  validity  in  a  complex  way.     In  this  section  of  the 
paper  a  modest  attempt  is  made  to  sort  through  a  few  of  the  complexi- 
ties.    Data  on  the  reliability  and  validity  indices  for  two  test  . 
lengths,  three  advancement  scores,  and  five  domain  score  distributions 
are  reported  in  Table  2.    A  few  comments  may  help  to  interpret  the 
results  in  the  Table.     Note,  however,  that  because  of  sampling  errors, 
not  all  of  the  results  are  consistent  with  the  interpretations  offered 
be  low . 


241 


-52-  \ 

\ 

Tabic  2 

Effect  of  Advancement  Score  on  Several  Reliability 
and  Validity  Indices  with  Five  Domain  Score  Distributions 


Test      Advancement  Domain  Score  Distribution 

Statistic       Length       Score  1  2  3  4 


\ 


Dec  is  ion 
Consistency 


Kappa 


Decision 
Accuracy 


5 

3  < 

.72 

.93 

.84 

.76 

.71 

5 

4 

..64 

.71 

.76 

.66 

■  .71 

5 

5 

.74 

.55 

.76 

.70 

.87 

10 

7 

.73 

.84 

.80 

.77 

\73 

10 

8 

.  74 

.  74 

Q  1 

*          •  OJ. 

•  /  £ 

10 

9 

.77 

.62 

.84 

.74 

.89 

5 

< 

3 

.22 

.08 

„  .58' 

.32 

.41 

5 

4 

.28 

.11 

.49 

.31 

.35 

5 

5 

.24 

,10 

.49 

.29 

.34 

10 

7 

.47 

.16 

.51 

.45 

.30 

10 

8 

.46 

.28 

.62 

.43 

.47 

10 

9 

.33 

.23 

.67 

.40 

.40 

5 

3 

.43 

.70 

.72 

.55 

.62 

5 

4 

.60 

.74 

.82 

.68 

.76 

5 

5 

.83 

.59 

.82 

.74 

.88 

10 

7 

.56 

.80 

.75 

.74 

.78 

10 

8 

.6P 

.77 

.87 

.77 

.90 

10 

9 

.83 

.71 

.89 

.83 

.95 

Predictive 
Validity 


Efficiency 


5 

'  3 

.25 

.09 

.54 

.36 

.25 

5 

4 

.29 

.32 

.65 

.40 

.43 

5 

5 

.48 

.22 

.64 

.48 

.40 

10 

7 

.31 

.38 

.56 

.54 

.33 

10 

8 

.42 

.51 

.75 

.55 

.55 

10 

9 

.50 

.42 

.78 

.58 

.39 

5 

3 

.02 

.51 

.55 

.15 

.51 

5 

4 

.45 

.64 

.78 

.53 

.75 

5 

5 

.82 

.39 

.83 

.75 

.93 

10 

7 

.43 

.71 

.81 

.62 

.79 

10 

8 

.66 

.70 

.89 

.74 

.93 

10 

9 

.83 

.61 

.93 

.88 

.97 

9 

ERIC 


242 


Decision  Consistency 


It  ifS  very  clear  that  as  the  advancement  score  moves  away  from 
the  center  of  a  domain  score  distribution, dec  Iston  consistency 
increases.    This 'explains  why  for  the  10-Uem  test  and  distribution 
five,  decision  consistency  ls  lowest  (.73)  at  nc  =  .70  and  highest 
(.89)  at  n0  =  ,90.    The  mean  of  the  distribution  is  in  the  region 
of  .60.    The  reverse  result,  is  obtained  with  distribution  two. 
The  highest  value  (.84)  is  obtained  at  vQ  =  .70  and  the  lowest- 
value  (.62)  is  obtained  at  pa  =  .90.    The  mean  of  distribution 
two  is  about  .90.     Since  the  mean  of  distribution  four  is  *1oro 
to  .80,  it  is  not  surprising  to  observe  the  lowest  value  (.' 72) 
at  <tt0  «  .80  and  higher  values  at  frQ  =  .70  (.77)  and  at  ff0  =  .90 


Kapj>a 


While  the  results  are  not  too  clear  cut,  it  does  appear  that  the 
highest  values  of  kappa  are  obtained  when  an  advancement  score 
is  near  the  middle  of  a  domain  score  distribution.     Huyuh  (1976) 
noted  a  similar  finding  in  some  of  his  work. 


Decision  Accuracy 

Somewhat  surprisingly,  the  value  of  decision  accuracv  is 
monotonically  related  to  the  distance  between  f0  and  ff.  The 
role  that  tt0  plays  in  the  tabulated  results  is  not  readily 
apparent  from  the  reported  results. 


Predictive  Validity 

There  do  not  appear  to  be  any  trends  in  the  results. 


Ef f i  ciency 

The  results  here  are  identical  to  those  reported  for  decision 
accuracy  and  the  explanation  is  the  same. 


Using  the  Results  to  Determine  Test  Length 

Many  factors  will  have  an  influence  on  the  test  length  which  is 

finally  selected: 

1.    The  shape  (essentially  variability)  of  the  domain  score  dis- 
tribution (regardless  of  which  statistic  Is  chosen,  It 
is  clear  from  Figures  2  to  6  that  the  variability  of  the 


m  gfl  IMS®* 

243 


-54- 


domain  score  distribution  has  a  considerable  Influence 
on  the  results).    In  general,  higher  indices  are  obtained 
with  heterogeneous  domain  score  distributions. 

2.    The  placement  of  cut-off  scores  (in  general,  higher 
validity  indices  are  obtained  if  ttq  and  IT  are  not  too 


3.  The  selection  of  advancement  scores  (has  a  complicated 
relationship  to  test  length) • 

4.  The  desired  level  of  one  of  the  reliability  and/or  validity 
indices  (the  higher  the  desired  value,  the  longer  the 
required  test  must  be). 

Six  steps  are  offered  next  for  determining  test. length  in 

particular  testing  situations: 

 ■    ■  |  ;  

1.  Select  a  primary  statistic  of  interest  (this!  is  usually 
"decision  accuracy").  j 

2.  Set  a  cut-off  score  (if  ttq  =  .80,  proceed  through  the 
remaining  steps;  if  tt0  i  .80,  it  will  be  necessary  to 
generate  additional  results  using  the  method  described 
in  the  last  section  of  this  paper). 

3.  Set  advancement  scores  corresponding  to  test  lengths 
under  consideration  which  are  near  .80  (if  tt0  =  .80 
Figures  2  to  6  will  provide  usable  results). 

4.  Specify  a  prior  belief  about  the  domain  score  distribution 
for  the  group  of  examinees  who  will  be  assessed.     If  con- 
servative ■ results  are  desired,  it  is  best  to  work  with 
homogeneous  distributions  centered  around  ttq  ■  .80. 

5.  Choose  (a)  or  (b) 

(a)  With  the  statistic  identified  in  step  1,  and  a  desired 
value  for  the  statistic,  find  the  correct  figure  and 
read  off  the  corresponding  test  length  from  the  curve 
corresponding  to  the  domain  score  distribution  selected 
in  step  4. 


For  example,  suppose  a  test  developer  desired  a 
decision  accuracy  statistic  equal  to  .80  and  the 
most  likely  domain  score  distribution  Is  number  1. 
From  Figure  4,  the  corresponding  test  length  is  21 
items • 

(b)  With  the  reliability  or  validity  statistic  selected  in 
step  1,  and  several  test  lengths  of  Interest,  find  the 
corresponding  values  *:f  the  desired  statistic  for  the 


close) . 


-55- 


test  lengths  of  Interest.    Select  the  test  length 
which  seems  suitable. 

6.    Check  "decision  consistency"  and/or  "kappa"  for  the  test 
length  selected  in  step  5.     (With  the  example  In  5a 
above,  the  value  is  .75  for  decision  const stem* v.)  If 
the  value  is  too  low  for  the  intended  purpose    of  the  test, 
determine  a  value  which  is  not,  read  off  the  corresponding 
test  length,  and  then  repeat  step  5a  or  5h  again.  


9 

ERIC 


The  values  provided  in  the  figures  are  only  approximations. 
Still,  they  should  be  helpful  to  test  developers  who  aspire  to  set 
their  test  lengths  in  a  way  which  is  not  totally  dependent  on  guess 
work. 

D.  Suggestions  for  Further  Research  and  Development 
Because  of  (1)  the  considerable  importance  of  the  topics  under 
study  in  this  paper,  and  (2)  the  paucity  of  practical  research  results, 
it  is  easy  to  suggest  many  directions  for  further  work.    For  one,  a 
computer  program  is  needed  into  which  a  test  developer  can  (a)  provide 
a  prior  belief  about  the  shape  of  a  domain  specification  distribution 
for  some  group  of  examinees  to  be  tested,  (b)  select  a  test  model 
(probably  the  binomial  or  the  compound  binomial),  (c)  select  one  or 
more  reliability,  and  validity  indices  of  interest,  and  (d)  select 
test  lengths  and  advancement  scores  of  interest.    The  output  from 
the  computer  program  would  provide  a  basis  for  determining  test  length. 

One  of  the  spin  offs  from  this  simulation  study  is  the  avail- 
ability of  a  computer  program  that  has  some  of  the  features  mentioned 
above.     It  can  be  used  by  test  practitioners  to  generate  additional 
results  to  those  reported  in  the  paper.    Practitioners  must  only 
specify  (1)  a  prior  belief  about  the  distribution  of  domain  scores, 
(2)  suggest  test  lengths,  cut-off  scores,  and  advancement  scores,  and 


245 


(3)  select  cither  the  binomial  or  compound  binomial  test  model  from 
which  to  simulate  examinee  item  response  data.    Figures  similar  to 
those  reported  in  this  study  can  be  quickly  obtained.    A  write-up  of 
the  current  computer  program  is  in  preparation  and  will  be  available 
soon.    One  drawback  is  that  it  is  not  as  e#sy  a  system  to  use  nor 
does  it  have  as  many  features  as  might  be. desirable, 

A  second  area  for  further  work, is  in  the  area  of  "guidelines  * 
for  interpreting  the  reliability  and  validity  indices,11    In  the  area; 
of  norm-referenced  testing, even  with  a  plethora  of  textbooks  Available 
and  the  training  many  people  have  had,  there  is  still  considerable 
confusion  about  the  correct  interpretations  of  reliability  and  validity 

« 

indices.     Because  of  thq  newness  of  the  five  statistics  used  in  this 

study,  it  seems  clear  that  if  they  are  to  have  any  value  at  all,  increased 

effort  must  be  given  to  training  test  developers  in  the  use  of  these 

and  other  relevant  statistics. 

Third,  the  validity  of  the  relationships  reported  in  Figures  2 

to  6  among  test  length,  cut-off  scores,  advancement  scores,  and  domain 

score  distributions,  and  five  reliability  and  validity  indices  should 

be  compared  to  existing  results  reported  on  real  test  data.    In  a  very 

limited  way,  some  of  the  theoretical  results  ^reported  in  this  paper 

were  compared  to  results  obtained  from  real  data.    The  differences  were 

very  small  bur  considerably  more  work  of  this  general  type  should  be 

done.     The  reliability  results  would  be  part ieulaYly  easy  to, check, 

t 

Only  the  examinee  responses  to  large  sets  of  test  Items  keyed  to 
objectives  would  be  required.     "Tests"  of  varying  lengths  could  be 
drawn  from  the  oxamlnuo-item  pool  of  data  keyed  to  a  particular 

246"' 


-57- 

objective,  "pnrui lel-forms"  constructed,  and  various  advancement  scores 
considered.    Via  the  method  of  sampling  of  examinees,  assuming  the 
"pool"  of  examinees  was  heterogeneous  and  large  enough,  nearly  any 
domain  score  distribution  could  also  be  studied  as  .well. 

% 


• 


# 


9 

ERIC 


247 


-58- 


4.2,7    Method  of  Selecting  a  Procedure 
For  Determining  Test  Length 


Answers  to  the  three  questions  below  will  provide  a  basis  for 
selecting  one  of  the  four  methods  of  determining  test  length* 


Question  1;    Has  a  cut-off  score  been 
determined  using  a  suitable  method? 


Question  2:  Do  you  need  a 
quick  estimate  of  test 
length  or  more  extensive 
justification? 


Directive:  Set  a 
cut-off  score  using 
an  acceptable  method 


Directive:    Use  Millmanfs 
Tables. 


Question  3:  Do  you  have  or  cexx  you 
specify  prior  information  about  the 
domain  score  distribution? 


\ 

Directive: 

Use  Novick 

Directive:    Use  Fhaner 

and  Lewis1 

Bayesian 

and  Wilcox  Indifference 

Procedures. 

Zone  Approach. 

1   A 

OR 


Directive:  If  cut-off 
score  ■  .80,  use  the 
Eignor-Hambleton  figures, 


OR 


Directive:  Use  Table  A. 2. 3  in 

this  unit  developed  by 
Novick  &  Lewis  (Uniform 
Prior) . 


248 


-59- 


ERIC 


4.3    Test  Item  Selection 

The  item  selection  process  is  quite  simple  provided  the  criterion- 
referenced  test  constructor    has  been  careful  in  defining  the  domain 
of  concern  and  in  constructing  test  items  (see  Unit  2).    That  is,  the 
test  developer  has  to  have  been  careful  to  define  the  size  of  his/her 
domain  to  be  consonant  with  the  test's  purpose.    If  the  purpose  of 
testing  is  to  make  major  level  decisions  on,  for  instance  the  school 
level,  a  large  domain  size  can  be  tolerated.    If,  however,  the  purpose 
of  testing  is  to  provide  information  for  remedial  instruction,  a  smaller 
domain  size  is  needed.    Popham  (1978)  has  offered  some  suggestions  for 
ascertaining  domain  size.    The  critical  point  for  item  selection  is 
that  the  domain  be  a  reasonable  size  so  that  proper  sampling  from  the 
domain  can  occur.    If  the  domain  is  so  large  that  it  is  difficult  to 
see  how  to  generate  a  set  of  items  from  the  domain  for  the  test,  then 
the  domain  must  be  broken  up  into  sub-domains  and  items  generated  for 
those  sub-domains.    The  sampling  process  should  be  clear  for  these  sub- 
domains.    Thus,  it  is  critical  that  the  domain  be  of  a  size  that  a  set 
of  items  can  be  clearly  constructed  from  the  domain,  and  then  the  sampling 
process  can  be  carried  out  without  complications. 

Having  defined  a  domain  size  that  is  manageable  for  sampling 
is  not  enough;  the  test  developer  must  also  be  careful  to  ascertain 
that  all  the  items  constructed  for  the  domain  do  indeed  "tap"  the  be- 
havior specified.    The  items  must  adhere  to  the  restrictions  imposed 
on  the  domain  specifications. 

If  the  size  of  the  domain  is  manageable  for  the  sampling  process 
and  the  test  developer  is  sure  that  the  items  generated  "tap"  the 
specified  behavior,  then  the  item  selection  process  is  quite  simple. 

,  249 


-60- 


The  test  is  constructed  by  taking  either  a  random  or  stratified  random 
sample  of  items  from  the  domain.    It  should  be  noted  that  if  the  domain 
has  been  explicitly  defined  (see  section  2.1  of  Unit  2),  then  a 
random  sample  of  items  can  be  taken.    If  the  domain  has  to  be  defined 
implicitly ,  as  is  the  case  with  domain  specifications,  then  only  a 
representative  set  of  items  defining  the  domain  has  been  generated, 
and  a  random  sample  is  drawn  from  that  set  for  the  test.    That  is  really 
a  technical  distinction  referring  to  the  domain*  in  either  case,  the 
items  should  (in  theory)  be  selected  randomly  for  the  test. 

A  word  of  caution  should  be  presented  at  this  point.    Unlike  the 
procedures  for  norm-referenced  tests,  statistical  indices  should  not 
be  used  in  the  item  selection  process.    Item  difficulty  and  item  dis- 
crimination arenot  useful  in  the  item  selection  process;  these  indices 
may  be  useful  in  helping  to  detect  flawed  items  in  the  item  validation 
stage  (see  Unit  3).    According  to  Millman  (1974): 

Selection  of  items  on  these  criteria  can  result  in 
a  test  where  the  items  are  not  representative  of 
the  domain  in  difficulty  level  or  in  the  underlying  at- 
tributes   being  measured*    An  examinee fs  status 
relative  to  a  well-defined  domain  can  best  be 
gleaned  from  the  examinee fs  responses  to  a  repre- 
sentative sample  of  items  from  the  item  population. 
Items  chosen  by  empirical  means  are  likely  to  be 
average  in  difficulty  and  more  homogeneous  than  is 
true  for  all  the  items.    The  use  of  item  statistics 
destroys  the  random  selection  process,  a  defining 
characteristic  of       [criterion-referenced  tests]. 
Unless  items  are  selected  randomly,  the  estimate  of 
a  person's       domain      score       loses  meaning  and 
the  interpretability  of  the  test  score  is  reduced. 


In  sum,  items,  should  be  selected  by  random  sampling  from  the 
complete  set  of  items  generated  for  domains  defined  explicitly  or  from 
the  representative  set  of  items  generated  for  domains  defined  implicitly. 


One  advantage  of  choosing  representative  sets  of  test  items  is 
that  examinee  test  scores  (or  proportion-correct  scores)  provide  "un- 
biased" estimates,  of  their  domain  scores.    It  is  possible  also  to 
set  standards  and  interpret  examinee  test  performance  relative  to  those 
standards.    Unfortunately,  when  the  number  of  test  items  is  small  (as 
is  frequently  the  case),  the  consistency  of  decisions  (competent/ 
incompetent)  across  a  retest  administration  or  across  a  parallel-form 
administration  of  a  test  may  be  distressingly  low.    Increasing  the 
number  of  test  items  measuring  each  objective  is  helpful  but  often  it 
is  not  feasible  to  do  so.    One  answer  to  the  dilemma  is  as  follows: 
When  the  primary  purpose  of  the  testing  program  is  fo  make  dichotomous 
decisions  about  examinees/  a  more  effective  test  can  be  produced  if  test 
items  from  the  available  pool  of  test  items  measuring  each  objective 
are    selected  based  on  their  statistical  properties.    Specifically,  if 
(say)  a  standard  is  set  at  80%,  it  would  be  best  to  select  test  items 
which  have  p-values  (item  difficulty  levels)  in  the  region  of  .80  and 
which  have  the  highest  discrimination  indices.    A  test  constructed  in 
this  way  will  have  maximum  discriminating  power  in  the  region  where 
decisions  are  being  made  and  therefore  more  reliable  and  valid  decisions 
will  result.    One  possible  drawback  is  that  scores  derived  from  the  test 
cannot  be  used  to  make  descriptive  statements  about  examinee  levels  of 
performance  on  the  objectives  measured  in  the  test.    This  is  because  test 
items  measuring  each  objective  will  not  usually  constitute  a  representative 
sample.    In  theory,  there  is  at  least  one  way  to  make  descriptive  state- 
ments about  examinee  levels  of  performance  on  the  objectives  measured 
by  a  test  when  non-random  or  non-representative  samples  of  test  items 


251 


-62- 


are  chosen.    It  can  be  done  by  introducing  concepts  and  models  from 
the  field  of  latent  trait  theory.    The  feasibility,  however,,  of  such 
an  approach  has  not  been  tested. 


252 


ERIC 


-63- 


4.3.1    Post  Item  Selection  Checklist 

In  Unit  2  of  these  materials  a  set  of  checklists  were  offered 
that  should  be  useful  at  the  item  writing  stage  of  the  criterion- 
referenced  test  development  process.    Most  of  the  questions  posed  in 
those  checklists  can  be  answered  after  the  items  are  written;  there 
are,  however,  a  number  of  questions  that  can  only  be  answered  after 
the  test  items  have  been  selected  and  organized  into  a  test.  The 
checklist  that  follows  presents  the  questions  that  are  appropriate 
to  ask  after  the  test  items  have  been  selected  and  assembled  in  a  test. 


253 


\ 

\ 


-64- 


3/15/79 


Test  Directions  and  Item  Selection 


Review  Form 


' Domain  Specification: 


Reviewer: 


Date: 

0 


Test  Directions 

1.  Do  the  directions  indicate  the  test's 
purpose? 

2.  Do  the  directions  indicate  ^iow  the 
test  items  will  be  scored? 

3.  Do  the  directions  indicate  how 
examinees  are  to  "mark11  their 
answers  (on  the  test  booklet  or  a 
separate  answer  sheet)? 

4.  Are  there  any  practice  test  items? 

5.  Do  the  directions  indicate  the  time 
allowed  to  complete  the  test  items?  . 


Yes 


No  Unsure 


jL 


jL 


9 

ERIC 


Test  Items  • 

6.  Do  the  test  items  represent  at  least  an 
adequate  sample  from  the  domain  of  items 
defined  by  the  domain  specification? 

♦ 

7.  Do  any  of  the  test  items  contain  clues 
which  may  help  examinees  answer  other 
test  items  measuring  the  domain  speci- 
fication? 

8.  Will  examinees  learn  anything  from  one 
or  more  test  items  which  will  help  them 
answer  other  test  items? 

9.  Have  the  items  been  checked  with  content 
and  measurement  specialists  to  try  and 
eliminate  ambiguity,  technical  errors, 
and  other  errors  in  item  writing? 

10.  Has  the  number  of  item  formats  been 
kept  to  a  minimum? 


jL 


254 


✓ 


✓ 


11.  Were  the  most  "valid"  item,  formats  used? 

12.  Were  items  in  the  same  format  grouped 
together? 

13.  Do  the  correct  answers  follow  essentially 
a  random  pattern? 

Multiple-Choice 

14.  Has  the  number  of  negatively  stated 
item  stems  been  kept  to  a  minimum  (less 
than  10%)? 

True-False 

15.  Are  the  true  statements  of 

the  same  length  as  the  false  state- 
ments? 


n/n  indicates  the  desired  response. 


-66- 


4.4    Preparation  of  Directions 

In  this  and  subsequent  sections  of  this  unit,  the  procedures 
to  be  described  for  criterion-referenced  test  development  are  es- 
sentially the  same  for  norm-referenced  test  development.     Because  such 
procedures  are  well-documented,  what  follows  are  some  helpful  hints 
,  for  the  reader,  along  with  the  listing  of  references  that  may  be  re- 
ferred  to  for  a  more  in-depth  discussion. 

Payne  (1974)  has  presented  seven  criteria  that  should  be  kept 
in  mind  when  writing  test  directions.    These  criteria  are  from  the 
Traxler  (1951)  paper.    These  are: 

1.  Assume  that  the  examinees  and  examiner  know  nothing  at 
all  about  objective  tests. 

2.  In  writing  the  directions,  use  a  clear,  succinct  style. 
Be  as  explicit  as  possible,  but  avoid  long  drawn-out 
explanations. 

3.  Emphasize  the  more  important  directions  and  key  activi- 
ties through  the  use  of  underlying,  italics,  or  differ- 
ent type  size  or  style. 

4.  Give  the  examiner  and  each  proctor  full  instructions 
on  what  is  to  be  done  before,  during>  and  after  the 
administration. 

5.  Field  or  pretest  the  directions  with  a  sample  of  both 
examinees  and  examiners  to  identify  possible  misunder- 
standing and  inconsistencies  and  gather  suggestions  for 
improvement. 

6.  Keep  the  directions  for  different  forms,  subsections, 
or  booklets  as  uniform  as  possible. 

7.  Where  necessary  or  helpful,  give  practice  items  before 
each  regular  section. 


256 

7 

ERIC 


-67- 


Gronlund  (1976)  states  that  while  the  directions  should  be  as 
simple  and  concise  as  possible,  they  must  contain  information  on  each 
of  the  following: 


1.  The  purpose  of  the  test. 

2.  The  time  allowed  to  complete  the  test. 

3.  How  answers  should  be  recorded  (on  the  test  itself 

or  a  separate  answer  sheet) . 

4.  Whether  or  not  to  guess  when  in  doubt  about  the 

an  swer . 


Gronlund  (1976)  has  an  excellent  discussion  of  these  four  areas  of 
concern.    Ahmann  and  Glock  (1975)  also  have  a  good  discussion  on  pre- 
paration of  directions. 

In  reference  to  Gronlund1 s  fourth  point,  about  guessing  on  criterion- 
referenced  tests,  some  helpful  comments  can  be  made  at  this  point,  both 
about  the  guessing  itself  and  whether  or  not  to  use  correction  for 
guessing  formulas.    First  of  all,  it  is  unlikely  that  in  a  criterion- 
referenced  testing  context  a  student  would  be  guessing  blindly  at  an 
answer.    When  these  tests  are  used  in  instructional  settings,  such  as 
after  a  unit  of  study,  the  student  is  likely  to  have  partial  knowledge 

t 

about  an  answer  if  he/she  does  not  know  the  answer.    The  guideline  is 
if  the  student  can  eliminate  any  of  the  response  options  on  a  test 
question,  he/she  should  be  encouraged  to  attempt  the  question  utilizing 
the  smaller  option  set.    Hence,  in  such  a  case,  the  student  should  be 
encouraged  to  attempt  the  item. 

Correction- for-guessing  formulas  have  been  utilized  in  norm- 
referenced  testing  situations  because  of  the  concern  that  the  proper 
rank-ordering  of  students  based  on  test  results  may  be  upset  due  to 
the  predisposition  of  certain  students  to  guess  randomly  at  questions, 


25"' 


-68- 

and  other  examinees  to  omit  questions  even  when  they  are  reasonably 
certain  of  their  answers •    Further,  it  is  known  that  if  all  examinees 
have  sufficient  time  to  answer  all  items,  there  is  no  differ- 
ence in  the  rankings  of  students  on  corrected  scores  as 
compared  with  uncorrected  scores.    We  feel  that  correction-for- 
guessing  formulas  are  suitable  only  for  the  norm-referenced  context, 
where  rank-ordering  is  the  concern.    Further,  it  would  make  little 
sense  to  una  them  for  criterion-referenced  tests  because  with  these 
tests  students  are  usually  given  sufficient  time  to  complete  the 
questions* 


258 


-69- 


4.5    Layout  and  Test  Booklet  Preparation 

InJ assembling  the  items  into  a  test,  a  decision  must  be  made 
concerning  the  best  item  arrangement.  There  are  two  possible  ways 
of  organizing  a  set  of  items  in  a  criterion-referenced  te,st; 


1.  If  there  are  multiple  item  types,  the  items  should 
be  arranged  so  that  all  items  of  the  same  type  are 
grouped  together. 

2.  For  many  purposes, it  may  be  desirable  to  group  to- 
gether items  that  measure  the  same  objective  or  • 
domain  generated  from  a  domain  specification. 


In  only  certain  situations  will  both  possibilitiesbe  able  to  be  applied 
simultaneously.    Usually  one  method  of  organization  will  be  chosen  over 
the  other,  and  this  will  depend  upon  the  purpose  for  testing.-  For 
instance,  if  the  test  is  being  used  to  diagnose  problems  for  subsequent 
assignment  of  students  to  remedial  activities,  the  test  developer  would 
probably  want  to  group  items  tapping  the  same  objective  together.  This 
would  give  an  immediate  indication  of  those' objectives  the  student  is 
having  difficulty  with.    Further,  it  may  be  possible  to  organize  by 
item  type  within  objective  if ■  there  are  a  large  number  of  test  items ^ 
per  objective.     In  sum,  the  choice  of  method' of  organization  will  depend 
upon  the  purpose. for  testing. 

Gronlund  (1976)  suggests  that  if  the  organization  is  by  item 
type,  because  certain  item  types  are  more  difficult  than  others  and 
the  simpler  activities  should  come  firsts  the  following  order  should 
be  used: 


1.  True-false  items 

2.  Matching  items 

3.  Short-answer  items 

4.  Multiple-choice  items 

5.  Essay  questions 


259 


He  furthet  suggests  that  organization  by  item  type  should  always  be  • 
considered  first,  and  that  only. in  certain  situations  should  alter- 
nate  organizational  schemes  be  considered.    According  to  Gronlund 
(1976)': 

This  arrangement  provides  for  the  finest  set 
of  directions;  it  is  easier  for  the  pupils 
since  they  can  retain  the  same  mental  set,  ? 
throughout  each  section;  and  it  greatly  facili- 
tates scoring. 

The  following  guidelines  offered  for  test  booklet  preparation 

s 

are  relevant  for  teacher  prepared  tests.    These  points- have  been  synthe- 
sized  from  Gronlund  (1976)  and  Noll  and  Scannell  (1972).    An  indepth 
discussion  of  procedures  for  .preparation  and  reproduction  of  the  test 
can  be  found' in  an  article  by  Thorndike  (1971).    This  is  the  most 
recent,  indepth  discussion  of  these  procedures  that  the  authors  have 
seen. 

» 

/ 

In  these  materials,  the  following  useful  guidelines  are  offered: 

1.  Make  sure  that  test  items  are  spaced  so  that  they  can  be  read, 
answered,  and  scared  with  the  least  amount  of  difficulty. 
Double  space  between  items. 

2.  Make  sure  all  items  have  generous  borders. 

3.  Multiple-choice  items  should  have  the  alternatives  listed 
vertically  beneath  the  stem. 

4.  Do  not  split  an  item  onto  two  separate  pages.  . 

5.  With  interpretation  exercises,  place  the  introduction  on  a  ' 
facing  page  with  all  items  referring  to  it  on  a  single  page'.  ^ 

6.  If  not  using  an  answer  sheet,  the  space  fpr  answering  should 
be  down  the  left  side  of  the  page. 

7.  The  most  convenient  method  rf  response  is  circling  correct 

answers. 

8.  Test  items  should  be  numbered  consecutively  throughout ' the  test. 

9.  Tests  reproduced  by  processes  available  to  school  systems 
should  be  duplicated  on  one  side  of  the  sheet  only. 

10.  If  a  separate  answer  sheet *is  used,  test  booklets  can  be  reused. 
They  should  be  numbered  so  a  check  can  be  made  for  a  complete 
set  of  materials  after  test  administration. 


260 


-71- 


4.6    Preparation  of  Scoring  Keys 

If  a  standard,  commercial  answer  sheet  is  used,  either  the  answer 
sheets  can  be  scored, by  machine  or  a  punch-out  overlay  template  can  be 
used  in  scoring.    If  a  hand-scoring  answer  key  is  to  be  used,  Payne 
(1974),  based  on  the  Traxler  (1951)  article,  describes  three  varieties 
of  hand-scoring  keys  that  can  be  useful.    These  are:    The  fan  or  accordian, 
strip,  and  cut-out  keys.    What  follows  is  a  brief  description  of  each. 
The  descriptions  are  taken  directly  from  Payne  (1974) : 

Fan  Key;    This  key  consists  of  a  series  of  columns,  extending 
from  the  top  to  the  bottom  of  the  page,  on  which  are  recorded  acceptable 
answers  or  directions  scored  for  the  individual  items.    The  key  and  the 
answer  sheet  are  the  same  size  and  identically  spaced.    Usually  each 
column  corresponds  to  a  page  of  the  test.    The  key  is  folded  along  verti- 
cal lines  separating  its  columns  and  is  superimposed  on  the  appropriate 
page  of  the  test  or  next  to  the  appropriate  column  of  the  answer  sheet 
and  matched  to  the  corresponding  responses. 

Strip  Key:    Similar  to  the  fan  key,  this  method  employs  the  use  of 
separate  columns,  usually  on  cardboard. 

-  Cut-Out  Key:    Windows  are  cut  out  no  reveal  letters,  numbers, 
words,  or  phrases  on  the  answer  sheet.    The  key  is  superimposed  on  a 
page  of  the  test  or  answer  sheet. 

Gronlund  (1976)  offers  sou. 3  helpful  hints  that  can  be  used  in  the 
actual  scoring  process.    A  most  useful  hint  is  to  draw  a  red  line  through 
the  correct  answers  of  items  missed  rather  than  through  the  wrong  answers. 
This  indicates  to  the  student  which  items  he/she  missed  and  at  the  same 
time  indicates  Lhe  correct  answer. 


9 

ERIC 


261 


{ 


4.7    Preparation  of  Answer  Sheets 

If  a  teacher  prepared  answer  sheet  is  to  be  used,  the  following 
simple  guidelines  may  be  helpful: 

1.  Make  sure  that  the  number  on  the  items  correspond 
with  the  numbers  on  the  answer  sheet, 

2.  Number  the  items  on  the  answer  sheet  consecutively 
down  the  pages  rather  than  across. 

3.  Make  all  lines  for  answers  exactly  the  same  length. 


If  a  commercially  prepared  answer  sheet  is  to  be  used,  the  follow- 
ing suggestions  may  be  helpful: 

1.  Make  sure  that  the  answer  sheet  does  not  have  more 
response  options  than  the  tes.. . 

2.  Try  to  obtain  answer  sheets  that  have  answer  spaces 
running  down  a  column  of  the  answer  sheet  rather 
than  across.    If  the  answer  spaces  run  across,  make 
sure  to  notify  the  students. 

3.  Try  to  purchase  an  answer  sheet  that  has  approxi- 
mately the  same  number  of  answer  spaces  as  questions 
on  the  test. 


262 


0 

ERJC 


-73- 


4.8    Test  Administration 

An  excellent  discussion  of  factors  of  concern  in  the  test  admin- 
istration process  is  contained  in  an  article  by  Clemans   (1971).  What 

*  •  — - 

follows  is  material  discussed  in  the  Payne  (1974)  book  and  in  Gronlund 
(1976). 

In  order  to  insure        optimal  conditions,  so  that  test  scores 
can  have  meaning,  Prescott  has  prepared  the  following  set  of  guidelines 
for  administration  before,  during,  and  after  the  test  (taken  from 
Payne,  1974).    These  guidelines  are  relevant  for  both  standardized  and 
classroom  tests. 

Before  the  Testing  Date 

1.  Understand  nature  and  purposes  of  the  testing: 

a.  Tests  to  be  given. 

b.  Reasons  for  giving  tests. 

2.  DeAde  on  number  to  be  tested  at  one  time. 

3.  Decide  on  seating  arrangements.  t 

4.  Decide  on  exact  time  of  testing. 

a.  Avoid  day  before  holiday. 

b.  Avoid  conflicts  with  recess  of  other  groups. 

c.  Make  sure  there  is  ample  time. 

5.  Procure  and  check  test  materials: 

a.  Directions  for  administering. 

b.  Directions  for  scoring. 

c.  Test  booklets: 

(1)  One  for  each  pupil  and  examiner. 

d.  Answer  sheets: 

(1)  One  for  each  pupil  and  examiner. 

e.  Pencils  (regular  or  special). 

f.  Stopwatch  or  other  suitable  timer. 

g.  Scoring  keys. 

h.  "Testing—Do  Not  Disturb"  sign. 

i.  Other  supplies  (scratch  paper,  etc.). 

6.  Study  test  and  directions  carefully. 

a.  Familiarize  yourself  with: 

(1)  General  make-up  of  test. 

(2)  Time  limits. 

(3)  Directions. 

(4)  Method  of  indicating  answers. 

b.  Take  the  test  yourself. 


263 


-74- 


7.  Arrange  materials  for  distribution, 
a.  Count  number  needed. 

8.  Decide  on  order  in  which  materials  are  to  be  distributed 
and  collected. 

8.  Decide  what  pupils  who  finish  early  are  to  do. 

Just  Before  Testing 

1.  Make  sure  central  loudspeaker  is  disconnected, 

2.  Put  up  "Testing— Do  Not  Disturb"  sign. 

3.  See  that  desks  are  cleared. 

4.  See  that  pupils  have  sharpened  pencils. 

5.  Attend  to  toilet  needs  of  pupils. 

6.  Check  light ingc 

7.  Check  ventilation. 

8.  Make  seating  arrangements. 

During  Testing 

1.  Distribute  materials  according  to  predetermined  order. 

2.  Caution  pupils  not  to  begin  until  you  tell  them  to  do  so. 

3.  Make  sure  that  all  identifying  information  is  written 
on  booklet  or  answer  sheet, 

A.  Read  directions  exactly  as  given. 

5.  Give  signal  to  start. 

6.  Write  starting  and  finishing  times  on  the  chalkboard, 

7.  Move  quietly  about  the  room  to: 

a.  Make  sure  pupils  are  marking  answers  in  the  correct  place, 

b.  Make  sure  pupils  are  continuing  to  the  next  page  after 
finishing  the  previous  page. 

c.  Make  sure  pupils  stop  at  the  end  of  the  test. 

d.  Replace  broken  pencils. 

e.  Encourage  pupils  to  keep  working  until  time  is  called. 

f.  Make  sure  there  is  no  copying. 

g.  Attend  to  pupils  finishing  early. 

8.  Permit  no  outside  interruptions. 

9.  Stop  at  the  proper  time. 

Just  After  Testing 

1.  Collect  materials  according  to  predetermined  order. 

2.  Count  booklets  and  answer  sheets. 

3.  Make  a  record  of  any  incidents  observed  that  may  tend 
to  invalidate  scores  made  by  pupils. 


264 


ERIC 


-75- 


In  addition  to  these  guidelines,  Gronlund  (1976)  offers  the 
following  four  suggestions  about  activities  to  avoid  when  administer- 
ing the  test: 


"l.  Do  not  talk  unnecessarily  before  the  test. 

2.  Keep  interruptions  during  the  test  to  a  minimum. 

3.  Avoid  giving  hints  to  pupils  who  ask  about  individual 
items. 

4.  Prevent  cheating,  if  necessary.   


Further,  to  present  undue  test  anxiety,  Gronlund  (1976)  suggests  that 
the  teachers  or  test  administrator  be  careful  not  to: 


1.  Threaten  pupils  with  a  test  if  they  are  not  behaving. 

2.  Warn  pupils  to  do  their  best  "because  the  test  is 
important."  ,  t  . 

3.  Tell  students  they  must  work  fast  to  complete  the 

items  on  time. 

4.  Threaten  unpleasant  activities  if  they  fall.  


In  sum,  these  guidelines  for  administering  a  test  should  aid  in 
assuming  that  all  the  students  being  tested  are  being  given  a  fair 
chance  to  demonstrate  what  they  'mow  on  the  domains  being  tested. 


265 


4.9    References  Cited 


Ahmann,  J.  S.,  &  Glock,  M.  D.    Evaluating  pupil  growth.     (5th  ed.) 
Boston:    Allyn  and  Bacon,  1975. 

Clemans,  W.  V.  Test  administration.  In  R.  L.  Thorndike  (Ed.). 
Educational  Measurement.  (2nd  ed.)  Washington:  American 
Council  on  Education,  1971. 

Eignor,  D.  R. ,  &  Hambleton,  R.  K.    Effects  of  test  length  and  advancement 
score  on  several  criterion-referenced  test  reliability  and  validity 
indices.    Laboratory  of  Psychometric  and  Evaluative  Research  Report 
No.  86.    Amherst,  MA:    School  of  Education,  University  of  Massa- 
chusetts, 1979. 

Fhaner,  S.    Item  sampling  and  decision-making  in  achievement  testing. 
British  Journal  of  Mathematical  and  Statistical  Psychology, 
1974,  27,  172-175. 

Gronlund,  N.  E.    Measurement  and  evaluation  in  teaching.  (3rd  ed.) 
New  York:    MacMillan,  1976. 

Gronlund,  N.  E.    Constructing  achievement  tests.     (2nd  ed.)  Englewood 
Cliffs,  N.J.:    Prentice-Hall,  1977. 

Hambleton,  R.  K. ,  &  Eignor,  D.  R.    Adaptive  testing  applied  to  hierarchically- 
structured  objectives-based  curricula.     In  D.  J.  Weiss  (Ed.),  Pro- 
ceedings of  the  '.977  Computerized  Adaptive  Testing  Conference. 
Minneapolis,  MN:    University  of  Minnesota,  1978. 

Lord,  F,  M.,  &  Novick,  M.  R.     Statistical  theories  of  mental  test 
Scores.    Reading,  Mass.:    Addison-Wesley,  1968. 

Millman,  J.    Passing  scores  and  test  lengths  for  domain-referenced 
measures.    Review  of  Educational  Research,  1973,  43_,  205-216. 

Millman,  J.    Criterion-referenced  measurement.     In  W.  J.  Popham  (Ed.), 
Evaluation  in  education:    Current  applications.  Berkeley, 
California:    McCutchan  Publishing  Co.,  1974. 

Noll,  V.  H. ,  &  Scannell,  D.  P.     Introduction  tu  educational  measure- 
ment.    (3rd  ed.)    Boston:    Houghton  Mifflin,  1972. 

Novick,  M.  R. ,  &  Jackson,  P.  H.     Statistical  methods  for  educational 
and  psychological  research.    New  York:    McGraw-Hill,  1974. 


266 


-77- 


Novick,  M.  R. ,  &  Lewis,  C.    Prescribing  test  length  for  criterion- 
referenced  tneasurement.    In  C.  W.  Harris,  M.  C.  Alkin,  and 
W.  J.  Popham  (Eds.),  Problem  in  criterion-referenced  measur er- 
ment.    CSE  monograph  series  in  evaluation,  No.  3.    Los  Angeles. 
Cinler  for  the  Study  of  Evaluation,  University  of  California, 
1974. 

Payne,  D.  A.    The  *°°°««™«t  of  learning;    Cognitive  and  affective. 
Lexington,  MA:    D.  C.  Heath,  1974. 

Popham,        J.    Criterion-ref ^Pnrri measurement.    Englewood  Cliffs, 
NJ:    Prentice-Hall,  1978. 

Prescott,  G.  A.    Test  service  bulletin  102,  Test  administration  guide. 
New  York:    Harcourt  Brace  Jovanovich,  Undated. 

Spineti,  J.,  &  Hambleton,  R.  K.    A  computer  simulation  study  of  tailored 
testing  strategies  for  objectives-based  instructional  programs. c 
Educational  and  Psychological  Measurement,  1977,  37,  139-158. 

Thorndike,  R.  L.    Reproducing  the  test.     In  R.  L.  Thorndike  {Ed.), 
Educational  measurement.     (2nd  ed.)    Washington:  American 
Council  on  Education,  1971. 

Traxler,  A.  E.    Administering  and  scoring  the  obJecthlv^^- 

E.  F.  Undquist  (Ed.),  FdnraHonal  measurement,-  Washington. 
American  Council  on  Education,  1951. 

Wilcox,  R.    A  note  on  the  length  and ^^{"^T^  t"t' 
Journal  -«  Statistics.  1976,  1,  J3«  ««• 


267 


\ 


-78- 


References  Cited,  in  the  Eignor-Hambleton  Paper 

Alglna,  J.,  &  Noe,  M.  J.    A  study  of  the  accuracy  of  fiubkovlak'u  oinfc] co- 
administration estimate  of  the  coefficient  of  agreement  using  two 
true-score  estimates.    Journal  of  Educational  Measurement,  1978, 
15,  101-110. 

Berk,  R.  A.    Determination  of  optimal  cutting  scores  in  criterion- 
referenced  measurement.    Journal  of  Experimental  Education,  3  976, 
45.  4-9.  „ 

o 

Block,  J.  H.    Student  learning  and  the  setting  of  mastery  performance 
standards.    Educational  Horizons.  1972,  50,  183-190. 

Eignor,  D.  R.    Psychometric  and  methodological  contributions  l:b  criterion- 
referenced  testing  technology.    Unpublished  doctoral  dissertation, 
University  of  Massachusetts,  1979. 

Fhaner,  S.      Item  sampling  and  decision-making  in  achievement  testing. 
British  Journal  of  Mathematical  and  Statistical  Psychology,  1974, 
27,  172-175. 

Hambleton,  R.  K. ,  &  Eignor,  D.  R.    A  practitioner's  guide  to  criterion- 
referenced  test  development,  validation,  and  test  score  usage. 
Laboratory  of  Psychometric  and  Evaluative  Research  Report  No.  70, 
Amherst,  MA:    School  of  Education,  University  of .Massachusetts, 
1978. 

> 

Hambleton,  R.  K. ,  &  Novick,  M.  R.    Toward  an  integration  of  theory  and 
method  for  criterion-referenced  tests.    Journal  of  Educational 
Measurement,  1973,  10,  159-170. 

Hambleton,  R.  K. ,  &  Rovinelli,  R.    A  Fortran  IV  program  for  generating 
examinee  response  data  from  logistic  test:  models.  Behavioral 
Science,  1973,  1_7,  73-74. 

Hambleton,  R.  K.,  Swaminathan*  H. ,  Algina,  J.,  &  Coulson,  D.  B.  Criterion- 
referenced  testing  and  measurement:    A  review  of  technical  issues 
and  developments.    Review  of  Educational  Research,  1978,  48,  1-47. 

Huynh,  H.    On  the  reliability  of  decisions  in  domain-referenced  testing. 
Journal  of  Educational  Measurement >  1976,  L3,  253-264. 

Livingston,  S.  A.    Assessing  the  reliability  of  tests  used  to  make  pass/ 
tail  decisions.     COPA  Research  Report.     Princeton,  NJ:  Educational 
Testing  Service,  1^78. 

Lord,  F.  M.    A  strong  true-score  theory,  with  applications.  Psychometrika, 
'    1965,  30,  239-270. 


ERLC 


268 


-79-  ' 


Lord,  F.  M.  ,,&  Novick,  M.  R.    Statistical  theories  of  mental  test  scores, 
Reading,  MA:    Addison-Wesley,  1968. 

Marshall,  J.  L.,  &  Haertel,  E.  H.    The  mean  split-half  coefficient  of 
ft  ..  agreement:    A  single  administration  index  of  reliability  for 
mastery  tests.    Unpublished  manuscript,  University  of  Wisconsin, 
1976. 

Millman,  J.    Passing  scores  and  test  lengths  for  domain-referenced  mea- 
sures.   Review  of  Educational  Research,  1973,  43,  205-216. 

Millman,  J.    Criterion-referenced  measurement.    In  W.  J.  Popham  (Ed.), 
Evaluation  in  education:    Current  applications.    Berkeley,  OA; 
McCutchan  Publishing  Co. ,  1974. 

Novick,  M.  R.,  &  Jackson,  P.  H.    Statistical  methods  for  educational  and 
psychological  research.    New  York:    McGraw-Hill,  1974. 

Novick,  M.  R. ,  &  Lewis,  C.    Prescribing  test  length  for  criterion- 
referenced  measurement.    In  C.  W.  Harris,  M.  C.  Alkin,  and  W.  J. 
Popham  (Eds.),  Problems  in  criterion-referenced  measurement. 
CSE  monograph  series  in  evaluation,  No.  3.    Los  Angeles:  Center 
for  the  Study  of  Evaluation,  Univesrity  of  California,  1974. 

Popham,  W.  J.,  &  Husek,  T.  R.    Implications  of  criterion-referenced 
measurement.    Journal  of  Educational  Measurement,  1969,  6,  1-9. 

Subkoviak,  M.    Estimating  reliability  from  a  single  administration  of  a 
criterion-referenced  test.    Journal  of  Educational  Measurement, 
1976,  13,  265-275. 

Subkoviak,  M.  J.    Empirical  investigation  of -procedures  for  estimating 

reliability  for  mastery  tests.  Journal  of  Educational  Measurement, 
1978,  15,  111-116.  (a) 

Subkoviak,  M.  J.    The  reliability  of  mastery  classification  decisions. 
•  Paper  presented  at  the  First  Annual  Johns  Hopkins  University 
National  Symposium  on  Educational  Research,  Washington,  October 
27,  1978.  (b) 

Swaminathan,  H.,  Hambleton,  R.  K. ,  &  Algina,  J.    Reliability  of  criterion 
referenced  tests:    A  decision-theoretic  formulation.    Journal  of 
Educational  Measurement,  1974,  11,  263-268. 

Wilcox,  R.    A  note  on  the  length  and  passing  score    of  a  mastery  test. 
Journal  of  Educational  Statistics,  1976,  ly  359-364. 

Wilcox,  R.  R.    Estimating  the  likelihood  of  false-positive  and  false- 
negative  decisions  in  mastery  testing:    An  empirical  Bayes  approach 
'    Journal  of  Educational  Statistics,  1977,  2_,  289-307. 


263 


Unit  5 

Reliability,  Validity  and  Norms 

i  ■■  ■ 


Prepared  By 


Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

and   

Daniel  R.  Eignor 
Educational  Testing  Service 


March  15,  1979 
270 


Table  of  Contents 

Page 

5.0  Overview  of  the  Unit.   1 

5.1  Criterion-Referenced  Test  Score  Uses   2 

5.2  Approaches  to  Reliability  Assessment   3 

5.2.1  Early  Work   * 

5.2.2  Reliability  of  Domain  Score  Estimates    ' 

•    5.2.3  Reliability  of  Mastery  Classification  Decisions  ...  13 

5.2.4  Summary  of  the  Reliability  Discussion    26 

5.3  Validity  of  Criterion-Referenced  Tests   28 

5.3.1  Introduction   " 

5.3.2  Clarification  of  Several  Validity  Issues   4» 

5.3.3  Content  Validation  Studies   32 

5.3.4  Construct  Validation  Studies  •  ^ 

Guttman  Scalogram  Analysis   ^ 

Factor  Analysis   

Experimental  Studies  of  Sources  of  Invalidity  ....  J» 

.  .  39 

5.3.5  Summary  ' 


5.4  Norms  for  Interpreting  Criterion-Referenced  Test.  Scores  .  .  40 

 44 

5.5  References  


271 


5.0    Overview  of  the  Unit1 

This  unit  covers  step  ten  of  the  Criterion-Referenced  Test  Development 

and  Validation  Model  presented  in  Unit  1. 

A  good  test,  whether  it  is  norm-referenced  or  criterion-referenced, 
must  result  in  reliable  and  valid  test  scores.    The  particular  form  these 
two  psychometric  concepts  take  (i.e.,  how  they  will  be  estimated)  will 
depend  on  the  intended  use  of  the  criterion-referenced  test  scores.  In 
'  this  unit  of  the  materials,  we  will  offer  procedures  fo*  ascertaining  the 

i 

reliability  and  validity  of  criterion-referenced  test  scores. 

If  the  procedures  discussed  in  Units  2,  3,  and  A  are  carefully  fol- 
lowed, a  criterion-referenced  test  score  can  give  detailed  information  on 
what  an  individual  can  and  can't  do  with  respect  to  a  content  domain. 
Sometimes  this  information  isn't  enough  'however;  for  instance,  a  decision- 
maker might  also  want  to  know  how  well  a  student  (or  group  of  students)  is 
performing    relative  to  the  performance  of  other  groups  (perhaps  last  year's 
graduating  class  or  a  group  of  students  in  a  neighboring  school  district) . 
Norms  data  can  supply  the  extra  information  necessary  for  the  decision 
maker  to  determine  how  well,  on  a  comparative  basis,  an  individual  (or 
group)  is  performing.     In  section  5.4,  we  will  discuss  the  use  of  norms 
with  criterion-referenced  tests. 


Several  sections  of  the  material  are  from  Hambleton,  R.  K. ,  Swaminathan, 
,  AlgiX I! T Couxson,  D.  B.    Criterion-referenced  testing  and  measure- 
it:     A  review  of  technical  issues  and  developments.     Review  of  Educational 


H, 

ment : 

Research,  1978,  48,  1-47. 


272 


5.1    Criterion- Referenced  Test  Score  Uses 

Two  uses  of  criterion- referenced  test  scores  are  of  special  interest 
in  our  work.    The  first  involves  the  estimation  of  examinee  domain  scores. 
In  this  application,  it  is  important  to  minimize  the  "error"  defined  as  the 
difference  between  as  examinee's  "estimated  domain  score11/ and  "domain 
score."    The  second  application  involves  the  assignment/of  examinees  to  mastery 
states  (where  each  state  is  "keyed"  to  an  instructional  decision)  based  on 
their  criterion-referenced  test  score  performance.    In  this  second  applica- 
tion, among  other  thii •« the  test  score  user  must  be  concerned  about  the 
"errors"  arising  from    inconsistent  mastery  state  assignments  across 
parallel-form  administrations  of  the  test  or  across  a  retest  administration 
of  the  test. 


9 

ERIC 


273 


-3- 


5.2    Approaches  to  Reliability  Assessment 

5.2.1    Early  Work 

Perhaps  the  first  discussion  of  the  reliability  of  niLeiion- 
referenced  tests  was  by  Popham  and  Husek  '(1969).    These  authors, took  the 
point  of  view  that  while  internal  consistency  and  temporal  stability  may 
be  important  characteristics  of  test  scores  that  result  from  criterion- 
referenced  measurement,  the  coefficients  prescribed  by  classical  test  theory 
for  assessing  these  characteristics  may  be  inappropriate.    They  noted  the 
well-known  result  that  test  score  reliability  for  a  group  of  examinees  . 
is  dependent  on  test  score  variability.    Since  it  is  not  uncommon  to  ob • 
serve  rather  homogeneous  distributions  of  criterion-referenced  test  scores, 
they  feared  that  test  developers  would  "scrap"  their  tests  because  of  low 
reliability  values.    Basically,  they  argued  that  test  developers  should 
not  worry  too  much  if  they  obtained  low  classical  reliability  estimates 
(low  values  were  to  be  expected).    They  did  not,  however,  suggest  any 
concrete  alternate  approaches  for  estimating  reliability  of  criterion- 
referenced  tests.    In  hindsight,  they  might  have  suggested  that  test  devel- 
opers "create"  test  score  variance  by  "pooling"  test  performance  of  two 
groups  of  examinees-those  expected  to  be  "masters"  of  the  material  in- 
cluded in    a    test  (perhaps  a  group  of  examinees  after  instruction) 


9 

ERIC 


274 


-4- 


and  those  who  would  be  expected  to  be  "non-masters"  (perhaps  a  group  of 
examinees  prior  to  receiving  instruction).      It  then  would  be 
possible  to  apply  any  of  the  classical  reliability  approaches  and  inter- 
pret the  results  in    the  usual      way  (see,  for  example,  Haladyna,  1974). 
On  the  other    hand,  there  would  still  remain  problems  in  using  classical  ap- 
proaches to  reliability  with  criterion-referenced  tests.    These  problems  will 

be  discussed  below. 

Hambleton  and  Novick  (1973)  also  addressed  the  matter  of  classical 

test  theory  applications  to  criterion-referenced  tests.    They  noted: 

Thus,  it  seem^clear  that  the  classical  approaches  to 
reliability  and  validity  estimation  will  need  to  be  inter- 
preted more  cautiously  (or  discarded)  in  the  analysis  of 
criterion-referenced  tests.    Perhaps,  an  even  more  serious 
reservation  concerning  the  classical  approach  to  reliability 
and  validity  estimation  for  criterion-referenced  tests,  if 
one  looks  at  these  psychometric  concepts  in  decision- 
theoretic  terms,  is  thait  the  correlational  method  repre- 
sents an  inappropriate  choice  of  a  loss  function  (squared- 
error  loss  in  the  it  metric)  with  which  to  evaluate  a  test 
(p.  167). 

The  latter  of  their  two  points  is  important  but  unfortunately  not  often 

cited  by  test  developers  asa  reason  for  seeking  out  new  testing  methods 

for  the  design,  interpretation,  and  use  of  criterion-referenced  tests. 

One  of  the  first  suggestions  for  an  approach  to  the  reliability  of 
criterion-referenced  tests  came  from  Livingston  (1972a).    He  began  his 
interesting  work  by  assuming  that  the  purpose  of  a  criterion-referenced 
test  was  to  discriminate  each  examinee's  estimated  domain  score  from  a 
cut-off  score..     It  is  then  possible  to  redefine  variations  in  estimated 
domain  scores  and  domain  scores  about  the  cut-off  score  rather  than  the 
mean  domain  score  as  is  d^ne  in  classical  test  theory.  Livingston's 
approach  to  criterion-referenced  Lest  reliability  estimation  takes  the  form: 

275 

ERIC  0   (n)  +  (ir  -  TTg) 


-s- 


where  tt  is  an  estimated  domain  score,  tt  is  an  examinee's 
domain  score,    tt    is  the  mean  of  the  domain  scores, 
o2(tt)    is  the  \..<     ace  of  estimated  domain  scores  about  r.he 
cut-off  score,  ttq,  and  o2(tt)  is  the  variance  of  domain  scores  about  the 
cut-off  score,  ttq.     It  is  easy  to  see  that  Livingston's  estimate  of  reli- 
ability exceeds  the  classical  estimate  of  reliability  given  by  the  expression 

02(TT)/O2(ft) 

and  increases  as  (tt-tt0)2    increases.     In  other  words,  the  further 
the  group  mean  domain  score  is  from  the  cut-off  score,  the  more  reliable 
the  scores  are  said  to  be.    Notice,  even  though  domain  score  variance  may 
be  zero  (a  result  which  would  lead  to  a  zero  estimate  of  reliability  in 
classical  test  theory),  it  is  still  possible  for  Livingston's  estimate  to 
exceed  zero.     Immediately  following  the  publication  of  Livingston's  work 
there  were  several  published  responses  to  it  and  replies  from  Livingston 
(1972b,  1972c).    Harris  (1972)  made  the  observation  that  the  standard  error 
of  measurement  was  the  same  regardless  of  which  approach  to  reliability  was 
.used.     This  is  ai.  important  point  and  is  one  reason  for  not  rejecting  all 
concepts  from  classical  test  theory  with  criterion-referenced  tests.  The 
fact  is  that  the  sta.  :urd  error  of  measurement  is  one  method  for  setting 
up  confidence  bands  around  domain  score  estimates  (albeit  a  conservative 
method).     However,  this  particular  point,  in  and  of  itself,  does  not  de- 
tract from  Livingston's  formulation  or  the  usefulness  of  his  statistic. 

Hambleton  and  Novick  (1973)  took  issue  with  Livingston's  statement 
concerning  the  purple  of  criterion-referenced  tests.    They  atgued  that 
the  difference  of  an  examinee's  domain  score  from  a  cut-off  score  was  not 
nearly  so  important  as  whether  or  not  an  examinee  was  assigned  to  the  same 
side  of  the  cut-off  score  (mastery  state)  across  parallel-form  (or  retest) 
administrations  of    a    test.     Therefore,  they  predicted  Livingston's 

» 

er|c  27^ 


-6- 


approach  would  have  limited  usefulness.    Of  course,  this  is  conjecture  on 
their  part,  and  only  time  will  tell  if  they  are  correct.     Results  reported  by 
Hambleton  (1974)  do  support  the  Hambleton-Novick  position  but  it  is  quite 
possible  that  others  will  agree  with  Livingston.1 

Shavelson,  Block,  and  Ravitch  (1972)  took  issue  with  Livingston  for 
reporting  the  reliability  of  test  scores  obtaining  by  summing  across  items 
keyed  to  different  objectives.  Shavelson  et  al. ,  make  an  important  point,  (i.e., 
that  reliability  infestation  is  needed  on  each  subset  of  items  measuring  an 
objective  included  in  a  test),  but  it  is  a  point  that  Livingston  can  easily 
handle  in  his  own  formulation  (Livingston,  1972c).    Like,  Harris  (1972), 
Shavelson  and  his  colleagues  also  point  out  the  usefulness  of  the  standard 
error  of  measurement  of  a  test.    They  go  on  to  note  that  the  standard  error 
of  measurement  is  not  influenced  by  Livingston's  approach  to  reliability  esti- 
mation. 


1  Recently    we    had    an  opportunity 
to  read  an  excellent  manuscript  published  in  the  Journal  of  Educational 
Measurement  by  Brennan  and  Kane  (1977).     They  derive  a  reliability  measure 
(referred  to  in  their  work  as  an  index  of  dependability)  for  criterion- 
referenced  tests  which  is  developed  within  the  context  of  generalizability 
theory  (Cronbach,  Glesef,  Nanda,  and  Rajaratnam,  1972).    Like  Livingston, 
they  study  examinee  domain  score  deviations  from  a  cut-off  score.     They  point 
out  that  there  may  be  occasions  when  squared-error  loss  a  la  Livingston  (l?7Zz) 
is  a  more  appropriate  choice  of  loss  function  than  threshold  loss  adopted  by 
Hambleton  and  Novick  (1973).    They  ncte: 

A  squared-error  loss  function  has  the  advantage  of  being  sensitive 
to  the  magnitude  of  errors,  but  the  disadvantage  of  being  sensi- 
tive to  all  errors  of  measurement,  including  those  that  do  not  lead 
to  misclassif ication. 

Neither  of  these  loss  functions  is  ideal,  and  a  choice  between 
the  two  must  be  made  on  practical  grounds.     A  threshold  loss 
function  is  appropriate  when  there  is  a  sharp  cut-off,  and  all 
misclassifications  are,  at  least  approximately,  equal  in  their 
impact.     A  squared-error  loss  function  is  likely  to  be  more 
appropriate  when  eith?r  of  these  assumptions  is  violated. 
In  a  follow-up  paper  which  will  tepublished  soon, Brennan  and 'Kane  (in  pres?) 
carry  on  their  work  with  randomly  parallel  tests  and  concepts  from  genera i- 
izability  theory.     The  major  strength  of  this  new  work  is  that  they  are 
able  to  study  many  of  the  approaches  to  reliability  of  norm-referenced 
and  criterion-referenced  tests  within  a  single  framework  and  thereby  draw 
rn?^  some  important  similarities  and  differences  among  the  approaches. 

ERIC  277° 


-7- 


5.2.2    Reliability  of  Domain  Score  Estimates 

When  there  is  test  ucore  variance  it  is  possible  tr    stiraate  the 
standard  error  of  measurement  of  a  criterion- referenced  test.     ^«  will 
assume  for  convenience,  that  the  test  measures  only  a  single  objective.) 
Whereas  reliability  estimates  for  a  test  vary  from  one  sample  of  examinees 
to  another,  the  standard  error  of  measurement  is  generally  invariant 
across  samples  (Lord  and  Novick,  1968)  and  therefore  rather  useful  for  in- 
terpreting  test  scores,  whether  they  be  scores  from  a  norm  referenced  test 
or  a  criterion-referenced  test.    When  strictly  parallel- tests  are  avail- 
able, well-known  methods  for  estimating  the  standard  error  of  measurement 
can  be  used. 

In  computing  the  standard  error  of  measurement  (8^),  any  of  the  estab- 
lished procedures  for  determining  the  correlation  coefficient  can  be  used. 
That  is,  repeated  measures,  parallel  forms,  or  corrected  split-half  procedures 
may  be  used.    Further,  if  one  wants  to  use  a  lower  bound  estimate  of  reli- 
ability to  compute  SEM,  then  the  Kuder-Richardson  formula  -21  can  be  used. 

The  reader  should'  have  two  immediate  questions:     (1)  The  statistic  depends 
upon  a  correlation  coefficient,  so  what  will  be  the  effect  of  a  restricted 
range  of  scores?     (2)  How  do  you  interpret  the  statistic? 

To  answer  the  first  question,  we  must  first  understand  that  as  an 
indicant  of  error,  the  smaller  the  SE^  is  in  value,  the  better  the  test  is, 
re:     the  more  reliable  the  test  is.    With  this  in  mind,  one  next  must 
notice  that,  the  formula  for  the  SEM  involves  not  only  the  reliability  coef- 
ficient, but  also  the  standard  deviation  of  test  scores.    The  effect  of 
these  two  variables  is  such  that,  operating  in  unison,  they  allow  SEM  to  be 
unaffected  by  the  homogeneity  of  test  scores.     For  example,  suppose  all 
^  the  test  sco.es  were  clumped  at  the  upper  end  of  the  test  score  continuum. 


9 

ERIC 


?7« 


Then  r  would  be  low,  but  /l-r  would  be  a  large  number.    Likewise  SD  would 
be  low  in  value,  and  we  can  look  at  their  product  as  being  a  moderately 
sized  number.    If,  on  the  other  hand,  scores  are  spread  across  the  continuum, 
then  r  is  likely  to  be  large  in  value,  as  would  be  SD,  but  then  /l^r  would 
be  low,  and  the  result  would  again  be  a  moderately  sized  number.  The 
point  to  be  made,  using  this  very  simplistic  example,  is  that  whereas  the 
reliability  coefficient  is  affected  by  homogeneity  of  scores,  the  standard 
error  of  measurement,  due  to  the  nature  of  the  formula,  is  relatively  unaf- 
fected by  the  spread  of  scores  in  the  group  of  examinees  tested.    And  that 
is  how  it  should  be,  for  the  error  (E  =  X-T)  inherent  in  an  examinee's  test 
score,  should  not  depend  on  the  shape  of  the  test  score  distribution  for  a 
group  of  examinees. 

How  is  the  statistic  used?    What  issdone,  while  not  technically  correct, 
is  as  follows:    The  SEj^,  along  with  an  examinee fs  test  score,  are  used  to 
set  up  a  probability  statement  about  the  location  of  the  examinee's  (unknown) 
domain  score.    An  underlying  normal  distribution  is  assumed,  and  hence  the 
area  under  the  normal  curve  can  be  used  to^make  some  "reasonable"  statement 
about  an  examinee's  domain  score.    For  instance,  suppose  the  SEj^  for  a  test 
was  5  and  a  person  obtained  a  score  of  50.     Based  upon  the  normal  curve, 
68*  of  the  area  lies  within  one  standard  deviation  to  the  right  and  left 
of  the  mean.    Applied  to  this  example,  this  could  be  interpreted,  in  the 
non-technical  fashion  we  are  using,  that  the  chances  are  2  out  of  3  that 
the  individual's  domain  score  lies  between  one  standard  direction  above  the 
mean  (50  +  5)  and  one  standard  deviation  below  (50  -  5).    Here  the  test 
score  is  used  as  the  mean,  and  SE^  as  the  standard  deviation.     It  must  be 
pointed  out  that  on  a  strictly  theoretical  level,  the  above  interpretation  is 
wrong  on  two  counts.    One,  a  probability  of  2/3  can't  really  be  attached; 

279 


•9- 


the  score  is  either  between  45  and  55  (a  probability'  of  one)  or  not  (a 

t 

probability  of  »ero) .    Two,  as  mentioned  above,  the  SEM  .suould.be  applied 
to  the  domain  score.      So  why  do  it?    The  answer  lies  in  that,  fro*  a 
practical  point  of  view,  we  can  be  reasonably  sure  about  the  statement 
we  are  making,*  and  this, is  after  all,  better  than  no  statement  at  all. 


Example 


Suppose  an  examinee  answered  15  out  of  20  xtems  correctly  on  a 
criterion-referenced  test.    Suppose  also  that  the  test  score 
reliability  is'  .80  and  the  standard  deviation  of  domain  scares 
is  .15. 

Questions; 

1.  What  is  the  value  of  the  standard  error  of  measurement? 

2.  What  is  the  examinee's  domain  score  estimate? 

3.  What  are  the  lower  and  upper  limits  for  an  approximately  95% 
confidence  band    for  the  examinee's  domain  score? 

Answers  '• 


1.  SEM    -    SD  /E^r 

-    .15  x  .45 
=  .07 

2.  An  unbiased  domain  score  estimate  for  thu  examinee  is  .75 
(15  divided  by  20). 

3.  Upper  limit    -     ,75  +  2  x  .07 

=    .89  '  ... 

Lower  limit    =     .75  -  2  x  .07 
-  .61 

Therefore  it  can  be  said  that  there  is  an .approximately  95%  prob- 
ability that  an  examinee  with  a  domain  score  estimate  of  .75 
has  a  "domain  score"  somewhere  on  the  interval  L- 61,  .89J. 


280 


t 

-10-  I 

» 

It  is  often  the  case  that  parallel-forms  of  a  criterion-referenced 
test  are  constructed  by  randomly  sampling  items  from  a  "pool"  of  test 
items  keyed  to  an  objective.    Such  tests  are.  referred  to  as  randomly 
or  nominally  parallel  tests ,  and  typically  do  not  meet  the  require- 
ments for  strictly  parallel  tests.    Randomly  parallel  tests  are  examples 
of  the  type  of  measurements  for  which  generalizability  theory  (Cronbach, 
et  al. ,  1972)  is  intended.    It  is  appropriate  at  this  point  to  turn  to 
generalizability  theory  to  obtain  definitions  of  errors  of  measurement, 
of  error  variance,  and  formulae  for  estimating  error  variance.  (See 
Brennan  and  Kane  [1977,  in  press]  for  a  more  fully  developed  discussion  of 
the  topic.) 

Cronbach  et  aL  Q972,  p,  25-26),  defined  three  different  errors 
of  measurement.    One  error,  A^,  is  appropriate  when  thte  proportion-p orrect 

4 

score  in  taken  as  an  estimate  of        domain  score.    The  error       is  ap- 
propriate  when  a  linear  regression  estimate  of         domain  score  is  made, 
and  the  third  error       is  appropriate  when  an  estimate  of  the  deviation 
between  the  ith  examinee's  domain  score  and  the  mean  domain  score  is  made. 
The  second  error  will  not  be  discussed  because  topically  it  is  impossible 
to  obtain  a  regression  estimate  of         domain  score  on  the  basis  of  a 
single  randomly  parallel  test  (see    Cronbach  et  al. ,  1972,  p.  140-146). 
^  The  third  error  wi 11  not  be  discussed  here  because  typically  there  is  no 
reason  to  estimate  the  deviation  score  with  criterion-referenced  tests. 


281 


The  error       is  defined  as  the  difference  between  the  observed 
proportion  correct  score  and  the  domain  score  for  the  ith  examinee.  Sup- 
pose  a  domain  of  items  exists.    Let  xAj  be  the  score  (0  or  1)  for  the  ith  examinee 
on  the  jth  item,    Define  A*j  ■  xjj  -  t^. 

i 

For  an  n  item  test,  the  error  of  measurement  lUt  is  n"1  E  4,,, 
*  j  lJ 

Cronbach  et  al. »  (1972)  discussed  three  variances  for  These  are 

2-12  • 
aIij  "  n     ?  the  eifror  variance  for  examinee  i  on  an  n  item  test  con- 

A|i  j  ij 

structed  by  random  sampling  of  items;  a*  ■  ^^j^  »  the  average  over 
examinees  or  i  andV2  E  (I         -  E  |  A^)2  *  the  variance  over< 

examinees  of  a<  for  a  given  test. 

To  evaluate  the  accuracy  of  a  particular  test,  it  would  be  appro- 
priate to  estimate  the  third  variance  mentioned  above.    However,  estimation 
of  the  quantity  requires  the  administration  of  several  randomly  parallel 
forms,  which  may  not  be  feasible.    Moreover,  Cronbach  et  al.   (1972)  were 
pessimistic  about  the  utility  of  designs  for  estimating  parameters  that 
characterize  a  given  test  and  so  the  designs  will  not  be  reviewed  here* 
The  interested  reader  is  referred  to  Cronbach  et  al.   (1972,  p.  101-102), 
and  references  therein  for  details. 

If  an  n  item  test  has  been  administered,  an  estimate  of  a2  ran 

A 

be  obtained  by  laying  out  the  item  data  as  a  one  way  AN OVA  with  examinees 

as  the  factor.    Item  scores  are  considered  to  be  replications  within  a 

level  of  the  examinee  factor.     The  estimate  is  given  by 

a  2  »  A  , 
A      n      wp  » 


282 


where  MS      is  the  within  persons  or  replications  mean  square.    If  several 
wp 

randomly  parallel  forms  of  n  items^each  are  available  then  a*    can  be 

estimated  using  the  same  formula.    The  proportion-correct  scores  on  the 

various  forms  are  the  replications  within  a  level  of  the  examinee  factor. 

In  principle, it  is  possible  to  estimate  oj.     using  £he  formula 

fc|  i 

^^(n-l)      j-1  iJ 
for  each  examinee  where'  is  the  observed  proportion  correct  score 

(Istimated  domain  score)  for  the  ith  examinee.    The  factor  (N-n)/n  is  used 
when  the  domain  is  finite*    N  is  tlhe  number  of  items  in  the  domain.  When 
n  is  small  relative  to  N,  the  estimate  a h  i"   may.be  quite  variable  over 
random  samples  of  items.  ^ 

Another  approach  for  determining  the  accuracy  of. domain  score, 
estimates  was  reported  by  Millman  (1974)  and  Hambleton,  Swaniinathan , 
and  Algina  (1976).    They  suggested  that  the  standard  error  of  estimation 
derived  from  the  binomial  test  model,  given  by  the  expression    A  (1-  ir)/a, 
could  be  used  to  set  up  confidence  bands  around  domain  score  estimates. 
This  is  a  biased  estimate  and  an  unbiased  estimate  is  obtained  by  sub- 
stituting (n-1)  tor  n  in  the  expression.    This  is  an  expresi .on  for  the 
standard  deviation  of  errors  of  measurement  for  an  examinee  with  domain 
score  n  across  administrations  of  n  item  samples  drawn  at  random  from  an 
item  pool.    A  ccrrection  (^A)  can  be  introduced  under  the  radical  sign 
when  the  pool  of  test  items  is  finite.    Advantages  of  this  approach  are 
that  the  estimate  of  error  is  a  function  of  domain  score,  less  conservative 
estimates  of  error  than  the  one  provided  by  the  standard  error  of  measure- 
ment are  obtained,  and  the  effect  of  test  length  on  the  precision  of 
estimates  can  be  studied  easily.     In  addition,  the  estimate  is  relatively 
easy  to  compute.  283 


-13-  * 

5.2.3    Reliability  of  Mastery  Classification  Decisions 

Carver  (1970)  proposed  two  procedures  for  assessing  the  reliability 
of  criterion-referenced  tests.    The  first  procedure  requires  the  admin- 
istration of  the  same  test  to  two  comparable  groups,  and  a  comparison  of 
the  percentages  of  examinees  that  were  classified  as  masters.    The  second 
procedure  requires  the  administration  of  two  parallel  tests  to  the  same 
group,  and    a    comparison  of  the  percentage  of  "masters11  on  the  two  tests. 

With  either  procedure,  the  more  comparable  the  percentages. the  more  re- 

j 

liable  the  tests    are  said  to  be.  ' 
Carver  (1970)  rejected  a  correlational  approach  to  reliability, 

t 

arguing  that  reliability  depends  on  replicability ,  but  replicability  does 
not  depend  on  variance.    Carver's  procedures  were  based  on  the  replicability 
of  distributions,  while  the  usual  concept  of  reliability  in  mental  testing 
is  based  on  the  replicability  of  individual  scores.     If  satisfied,  his  proposed 
criteria  would  provide  only  the  weakest  form  of  evidence  for  criterion- 

referenced  test  reliability;  that  is,  his  conditions  are  necessary  but  not 
sufficient  to  establish  test  reliability. 

Hambleton  and  Novick  (1973)  suggested  that  the  reliability  of  mastery 
classification  decisions  should  be  defined  in  terms  of  the  consistency  of  de- 
cisions from  two  administrations  of  the  same  test  or  parallel  forms  of  a 
test.     Suppose  examinees  are  to  be  classified  into  m  mastery  states,  the 
index  of  reliability  tentatively  suggested  by  Hambleton  and  Novick  (1973)  was 

p° "  Ji  pkk 


284 


where        is  the  proportion  of  examinees  classified  in  the  kth  mastery 
state  on  the  two  administrations.    Tj\e  index  p0  then  is  the  observed 
proportion  of  decisions  that  are  in  agreement.    The  pQ  statistic  has 
considerable  intuitive  appeal  and  is  certainly  easy  to  calculate  but  it 
suffers  from  at  least  one  limitation. 

Svaminathan,  Hambleton  and  Algina  (1974)  argued  that  pQ  does 
not  take  into  account  the  proportion  of  agreement  that  occurs  by  chance 
alone  and  therefore  it  could  give  a  false  impression  to  users  of  the 
extent  of  mastery  classification  consistency.    They  suggested  using  < 
coefficient  <  (Cohen,  1960)  as  an  index  of  reliability.    This  coefficient 
is  defined  as  *  ' 

<  -  (PQ  "  Pc>/(l-pc) 

where 

Pc  -  kIx  Pk.  P.k 

The  symbols  p,     and  p  ,   represent  the  proportions  of  examinees  assigned 

.k 

to  mastery  state  k  on  the  first  and  second  administrations,  respectively. 
The  symbol  pc  represents    the  proportion  of  agreement  that  would  occur 
even  if  the  classifications  based  on  the  two  administrations  were  statis- 
tically independent.    Thus,  in  a  sense,  it  can  be  argued  that  k  takes 
into  account  the  composition  of  the  group,  and  in  this  sense,  t&  more  group 
Independent    than  the  simple  proportion  of  agreement  statistic,  pQ. 

The  properties  of  <  have  been  discussed  in  detail  by  Cohen  (I960,  1968) 
and  Fleiss,  Cohen  and  Everitt  (1969)  as  well  as  others.     For  present  pur- 
poses it  is  sufficient  to  note  that  the  upper  limit  is  +1  and  can  occur 
only  when  the  marginal  proportions  for  different  administrations  are  equal, 

285 


-15- 


The  lower  limit  is  .lose  to  -1.    The  precise  lower  limit  of  k  is  unimportant 
in  the  content  of  criterion-referenced  testing,  since  any  negative  value 
indicates  inconsistency  and,  therefore,  unreliable  decisions. 

The  coefficient  k  is  dependent  on  all  factors  that  affect  the 
decision-making  procedure;  the  cut-off  score,  the  heterogeneity  of  the 
group  of  examinees,  and  the  method  of  assigning  examinees  to  mastery 
states...  Millman  (personal  communication)  has  suggested  that  all 

of  these  factors  be  summarized  when  reporting  <  since  this  information 
would  contribute  to  its  interpretation. 


Example 

Suppose  parallel-forms  (denoted  Test  1  and  2)  of  a  A-item  test  are 
administered  to  a  group  of  six  students.    Suppose  further  that  the  cut-off 

score  is  set  equal  to  75%.    Consider  the  data  below: 

% 


Person 


Test  1 

Item 
1      2      3  A 


Score 


Test  2 

Item 
1      2      3  A 


A 

B 

C  . 

D 

E 

F 


1 
1 

1 


0      0  0 

111 
1 


0 


0 


0      0      0  0 


0 
0 


1 
1 


1 
1 


1 

0 


1 

A 

2 
0 
3 
2 


1 
1 
1 

1 
1 

0 


1 
1 
1 

0 

1 

1 


0  0 


1 
1 


0 
0 


0  0 


1 
1 


0 

1 


Score 

2 

3 
3 
1 
3 
3 


Obtest  lonj. 

What  is  the  value  of  <? 


286 


-16- 


Answer; 

We  require  the  proportions  of  examinees  who  passed  and  failed  on  each 
occasion.    The  information  is  reported  in  the  chart  below: 


4J 

H 


Test  2 

Master 

Non-master5 

*  Marginal  1 
Proportion 

Master 

.33 

i  1               "  1 
0 

r 

.33 

Non-master 

.33 

.33 

.67 

Marginal 
Proportion 

.67 

.33 

ial 


'22 


pc  ■    2    pi-  P-i  "  Pl-  P-l  +  P2-  P'2  "  (,67)  (,33)  +  (,33)  (,67) 
i«=l 


Po  ~  Pc 
1  -  p 

.66  -  .44 
1  -  .44 

.22 


.56 
.39 


f  -  .44 


Of  course,  in  practice  one  would  not  estimate  <  based  on  data  from  only 
six  examinees.     The  example  is  offered  here  for  illustrative  purposes 
only . 

287 


/ 

In  criterion-referenced  testing  situations,  it       often  the  case  that 
administering  parallel  forms  of   a    test  to  get  an  estin/ate  of  k  is  not 
feasible.    Possible  reasons  include:     (1)  The  fact  that  the  testing  is  built 
into  an  objectives-based  program  and  the  extra  testing  would  take  away  instruc- 
tional time,  and  (2)  testing  occurs  quite  often,  and  two  test  administrations 
for  each  criterion-referenced  tust  would  cause  the  testing  process  to  dominate 
the  students'  learning  time.     Therefore,  what,  is  needed  is  a  method  of  arriving., 
at  either  k,    or  another  suitable  index,  based  upon  one  administration  of  a  test. 

The  coefficient  *,and  p0,aie  defined,  in  terms  of  repeated  testings, 
but  it  would  be  very  useful  to  have  a  procedure  for  estimating  k  or  pQ  on  the 
basis  of  a  single  testing.    Such  a  procedure  has  been  provided  by  Hunyh 
(1976a)  who  prefers  <  to  Pq.    Hunyh  (1976a,   1978)        assumed  that  £(x|«)  , 
is  binomial.    He  also  assumed  that  the  marginal  distribution  of  the  domain 
scores  is  a  two  parameter  beta /distribution.    From  these  assumptions  it 
follows  that  the  marginal  distribution  of  test  scores  obtained  by  admin- 
istering any  random  sample  of  n  items  is  a  negative  hypergeometric  distri- 
bution.    Further,  the  joint  distribution  of  scores  obtained  by  administering 
two  randomly  parallel  n  item  tests  is  a  bivariate  negative  hypergeometric 
distribution.    We  will  not  review  his  mathematical  development  here.  It 
is  clearly  reported  in  his  paper.     It  is  sufficient  to  say  that  his 
solution  is  workable,  although  the  computations  involved  in  obtaining  <  can 
be  tedious  when  there  are  a  moderate  number  of  possible  test  scores  above 
the  cut-off  score.    Huynh  (1976a)  also  provided  an  approximate  procedure  for 
estimating  <  which  appears  to  work  fairly  well,  if  the  number  of  test  items 
is  not  too  small. 


2SS 


0 

ERIC 


-18- 

Alternatlve  procadures  for  estimating  reliability  from  a  single 
administration  have  been  provided  by  Subkoviak  (1976)  who  prefers  to  work 
with  pQ.    While  Huynh's  approach  is  more  mathematically  tractable,  it  may 
be  far  less  useful  when  the  number  of  examinees  is  small,  a  fairly 
common  occurrence  in  objec  :  .\ves-based  instructional  programs. 

Subkoviak  (1976)  defined  a  coefficient  of  agreement  for  individual 
1,  denoted  p      »  as  the  probability  of  consistent  mastery  classification  of 
examinee  i  on  parallel  forms,  denoted  X  and  Y.    For  the  case  of  two 
mastery  3tates,  this  probability  is  given  by 


p<i)  -  Prob(Xi>    c,  Y^    c)  +  Prob  (X^c,  Yt<c)v 


(1) 


where  c  is  the  cut-off  score.    X^  and  Y^  are  scores  for  examinee  i  on  the  two  tests. 
The  two  terms  in  Equation  1    represent  the  probability  of  examinee  i 
being  assigned  to  a  mastery  state  or  a  non-mastery  state  on  each  test 
administration,  respectively.  \  The  coefficient  of  agreement  for  a  group 
of    N  examinees  is^  given  by         \ — —  

Po  V-Pc  /N* 
i»l 

In  order  to  estimate  p^  ,  Subkoviak  assumed  that  for  each  examinee, 
scores  on  the  two  forms  of  the  criterion-referenced  test  were  independently 
and  identically  distributed.     Also,  he  assumed  X^^  and  Yi  for  a  fixeu^ 
examinee  were  identically  binomially  distributed.     This  is  a  questionable 
assumption  although  tost  item  responses  are  usually  scores  0  or  1,  and 
item  responses  are  independent.     However,  the  assumption  implies  that  the 
Items  making  up  the  test  are  equally  difficult  ar.d  this  will  seldom  be 
the  case.     (fortunately ,  Subkoviak  addressed  this  poitft  in  his  paper  and 
of fered  a  substitute  expression  —  the  compound  binomial  model  —  to  handle 

289 


/ 


(2) 


(3) 


the  more  typical  case.)    With  only  ths  two  assumptions  above,  Subkoviak 
was  able  to  show 

P«i»<0  -    I      ("')  "i1  (l-"i5n"Xl 

t 

and 

2  2 
p(i)  -  [pCX.^c)]    +  [1  -  pCX^c)] 

A  , 

Once  an  estimate  of  an  examinee's  domain  score  (i^),  denoted  Tt±,  is 
obtained,  ptt^c)  can  be  determined  by -substituting  i±  for  i±  in  Equation 
(2).   P(1)  is  obtained,  by  substituting  the  result  from  Equation  (2)  into 
Equation  (3).    Any  of  the  methods  discussed    in  Unit  8  could 
be  used  to  estimate' an  examinee's  domain  score.     Subkoviak  suggested  in 
his  paper  using  a  regression  estimate  of  i^,  but  the  merits  of  this  ap- 
proach would  depend  on  the  sample  estimates  of  group  mean  performance 
and  reliability  (as.  he  correctly,  noted) .    He- also  offered  several  other 
possible  domain  score  estimates,  several  of  which  will  be  discussed 
in  Unit  8,       and  others  which  have  been  reported  by  Lord  and  Novick 
(1968>.     A  group  estimate  of  the  expected  proportion' of  agreement  in 
mastery  classifications  across  parallel- form  administrations  can  be  ob- 
tained by  averaging  the  values  of  p<«,  for  i-1.  2  H,  where  N  is  the 

number  of  examinees  in  the  group. 

It  is  also  possible  to  obtain  an  estimate  of  k  using  Subkoviak' s 
method.     The  only  additional  information  needed  is  the  proportion  of 
examinees  assigned  to  each  mastery  state  on  the  single  test  administration. 
By  making  the  reasonable  assumption  that  those  proportions  would  ho  the 
same  on  a  retest  or  a  parallel-form  administration,  the  proportion  of 
agreement  expected  by  chance  (pc)  can  be  obtained  by  the  method  introduced 


290 


-20- 


\ 

earlier  (pc  =  Pj.         +  P2.„P.2)*    For  example,' using  -the  first  , set  of 
test  scores  from  the  previous  example,  it  is  seen  that  two  of  the  six 
examinees  would  have  been  assigned  to  a  mastery  state  based  on  their 

2  2 

test  scores.    Therefore,  p2<  =  .33  and  px.  =  .67,  and  pr  =  .55  (pc  «  .33    +  .67  ). 
With  a  value  of  p£  and  with  the  pQ  estimate  from  Subkoviak's  method, 
kappa  can  quickly  be  calculated,  if  desired. 


Subkoviak's  approach  to  estimating  the  consistency  of  mastery 
classifications  across  parallel-rform  administrations  can  provide  either 
individual  or  group  information,  and  can  be  estiwate.3  from  a  single 
administration  of  a  test.    The  only  two  miner  problems  are  that  the 
probability  estimates  are  inflated  due "to  the  inclusion  of  chance  agreement, 
and  it  is  unreasonable  to  assume  all  items  in  a  criterion-refetenced  test 


291 


are  equally  difficult.    However,  on  this  latter  point*  Subkoviiik  has  also 
offered  a  slightlyo  different  model  (compound  binomial)  w^^h  is  capable 
of  handling  the  situation. 

Subkoviak's  method  makes  it  possible  to  compute  the  coefficient  of 
agreement  in  mastery  states  across  occasions  for  an  individual,  and  also 
the  coefficient  of  agreement  for  a' group  of  N  persons.    Since  the  formulas 
developed  by  Subkoviak  are  somewhat  comp^lex^ji  step-by-step  procedure  will 
be  spfecified  and  an  example  will  be  offered. 

The  steps  in  the  method  are  as  follows: 

1.  Obtain  an  estimate  of  the  proportion  of  items  in  the 'whole  domain 
of  items  an  examinee  can  answer  correctly.    A  convenient  estimate 

is  obtained  by  setting  pi  =  , 

n 

where  p^  -  proportion-correct  score  for  examinee  i, 
x±  -  his/her  test  score, 

n  -  total  number  of  items  included  in  the  test  (measuring  the 
v  objective  of  interest). 

2.  Determine  the  probability  that  the  examinee's  score  is  greater 
than  or  equal  to  the  cutting  score  (c  )  using  the  form  of  the 
underlying  (binomial)  distribution.    The  probability  is  given  by: 

* 

n         n      A  x. 
P  (x±  *  c  )  *      *        (x  )  P^  U-pV  i 
Xj=c  i 


where  p^x^  c    and  n  are  defined  as  before,  and 

,n  x  nl 
i 

■t. 


fn  v    m  ni  

^xiJ    "    xi! (n-xi) ! 


292 


where  nl  85  n(n-l)(n-2).  •  •  , 
[  for  example: 
J       4!  -  4(3) (2) (1)], 

*       ,  • 

3t    Using  the  result  from  step  (2) ,  compute  the  coefficient  of 
,  agreement  for  person  i  using  the  following  formula; 
P*1*    -  [HZ}  *  c  )]2  +  [l-*(Xi  >c  )]2  * 


4t    Finally, compute  the  coefficient  of  agreement         for  a  group 
of  N  persons,  using  the<  following  formula: 

p     -  i-1  

rc   


Theofjttial  result,  p    ,  provides  an  estimate  of  the  coefficient  of 

agreement  for  the  group  ^kad  ^  test  administrations  taken  place.  The  sub- 
script: c"  is    included  to  clarify  that  the  coefficient  is  dependent 'upon  the 

r,  * 

assigned  cut  off  score.    If  p      is  high  (i.e.,  close  to  one),  we  can  be  sure 
that  there  would  be  a  high  degree  of  consistency  of  placement  into  mastery 
states  over  the  two  occasions. 

If  the  number  of  test  items  is  small,   usually     a  better  estimate 
(than  p^)  of  an  examinee's  domain  score  can  be  obtained  by  using  a  regression 
estimate  of  domain  score  tp^  =  Pi  r  +  Pi  (1~r)»  where  r  *  test  reliability, 
and  p,  =• average  proportion-correct  score  for  the  examinees],  or  a  Bayesian 
estimate.     (Several  promising  Bayesian  estimates  are  introduced  in  Unit  &.) 
The    improved    estimate  can  be  substituted  for  p^  in  step  2.    A  quick 
way  to  compute  r,  the  test  reliability,  is  to  use  Kuder-Richardson  formula — 
21  (KR21),  given  by: 


293 


-23- 
x(n-x) 
nS2 


Subkoviak  (1976)  uses  thi£  particular  approach, 
offered  next,'*  * 


An 'example  Is 


Example:    .  ^ 
Given  the  data  for  test  1  in  the  previous  example,  compute  j>c.  Use 
regression  estimates  of  domain  scores,  and  KRn  as  an  estimate  of  test 
score  reliability. 

Test  1 

*        a  ^   '  

Score  — 


Person 


Item 
2  3 


A 

i 

0 

0 

0 

1 

B 

i 

1 

1 

1 

4 

C 

i 

0 

1 

0 

.2 

D 

0 

p 

0 

0 

0 

E 

0 

1  ' 

1 

1 

3 

F 

0 

1 

1 

0 

2 

N 


(c) 


n 


n-1 


[1- 


nSx2 


i  n_  2(4-2)  i 
3   »U    4(1.67)  J 


[1  -  .599] 

J 


Reliability  Estimate 

(a)    x  =  -  £54  *  ±1  -  2.0         (b)  sj 


x) 

N 


2   '  - 


(l-2)2-K4-2)2-K2-2)2-K0-2)^-K3-2)^-K2-2) 
 "  6 

1+4+4+1 


6 

1.67 


2Qd 


ERIC 


-24- 


Regres9ton  ggtlnfateg  of  Domain  Scores 


(d) 

A 
A 

1 

n 

+'(1 

-  .37 

P.  " 

•  53(f) 

+  (i 

-  .53)  (1) 

*♦ 

r 

• 

* 

•53(f) 

+  (i 

"  -53)  (|) 

■  .77.'  And, 

in  A  like  fashion 

A 

A 

P3" 

•50,  p4 

-  .24, 

P5—  .63, 

h  °  ,50, 

(e)         P(xt  »  c  )  «   '  E      (x*)  (.37)*1,  (1  -  .37)Xi 

xi=C 

For  individual  1, 

# 

P(X]L  *  3)  -      E      (x£    (.37)xi  (1  -  .37)xi 
x^-3 


3 

A         41*  . 
and  (3)  ITiT 


P(xL  >  3)  -  .2415  +  .0187 
-  .2602 

4 


(*)  (.37)3  O.-  .37)1  +  (J)  (.37)4  (1  -  .37)' 


For  individual  2,  P(x,  >  3)  -    *      (4)  (.74)Xi  (.23)n"Xl    -  .7630 

4 

3,  P(x,  *  3)  -    I      C)  (-5)Xi  (.5)n"Xi 


4s   /  rN<M  ,  ■  ,3125 


x^«3 


'     ■        4,  P(x4  >  3)  -    S      6  (.24)Xi  (.76)n"Xi    -  -0453 

5,  P(x5  »3>  -    Z      (S  (.63)xi  (.37)n-xi    -  -5275 

xi«3  i 

4 

6,  P(x6  >  3)  -    E  .(.5)Xi  (.5)n"Xi       -  .3125 

^  x^D  3 

all  computed  in  a  like  fashion. 

295 


(f) 


-25- 


■  '  P(i)    -  [PCxp^c^)]2  +  [1  -  PC^  »  c  )]' 


For  individual  1, 

piX>  '  a  tP(x1>x  3)]2  +  [1  -  l(xx  >^3)]2 


-  (.26of2  +  (1  -  *.260)2 
»  .0676  +  .5776 

-  .6452 


In  a  like  fashion    pc(2)  -  (.763) 2  +  (1  "  .763^         =  .6383 

'p<3)  -  (-3125)2  +  (1  -•  .3125J2    =  .5704 


c 

<4)  -  (.0453)2  +  (1  -  .0453)2    -  .9134 


P 


c 

P 


(5)  =  (.5275)2  +  (1  -  .5275)2    -  .5016 

c 

<6)  =  <.3125)2  +  (1  -~.3125)^  >5704 


(g)   Finally        J  (i) 


i-1  C 


Pc  N 

6 


becomes 


and  J), 


E  Pc 
i»l 


=  .64  . 


The  coefficient  of  agreement  index  of  *64  obtained  using  Subkoviak's  method 
should  be  fairly  close  in  value  to  the  observed  proportion  of  agreement,  po, 
which  is  used  in  computing    *.     That  value  was  .66,  and  as  the  reader  can  see, 
^  the  values  closely  coincide. 


296 


-  V 

5.2.4    Summary  of  the  Reliability  Discussion 

« 

'  *  The  definition  of  criterion-referenced  reliability  chosen  should 
depend  upon  how  the  test  scores  are  used.    Once  a  decision  has  been  made 
about  test  usage,  there  are  still  a  number  of  ways  of  assessing  reliability, 
depending  either  upon  the  underlying  distributional  assumptions  you  choose 
to  make  or  the  number  of  test  administrations  possible.    The  chart  on  the  , 
liext  page  will  be  helpful  in  summarizing  the  material  in  section  5.2. 

e 

*  - 

Perhaps  it  should  be  stressed  at  this  point  that  reliability  infor- 
mat ion  (whether  one  is  discussing  domain  scores  or  mastery  classification 

a 

decisions)  needs  to  be  reported  on  an  objective  by  objective  basis 
(Hambleton  and  Novick,  1973;  Swaminathan  et  al.,  1975).    If  a  criterion- 
referenced  test  measures  more  than  a  single  objective,  as  will  usually  be 

t  u 

the  case,  the  test  items  should  be  arranged  into  clusters  according  to  'the 
objectives  being  measured.    Within  each  of  these  clusters  of  items,  domain 
scores  may  be  estimated  or  mastery  classifications  made.    Whatever  the  use 
*of  the  scores, ^appropriate  reliability" information  should  be  reported  on 
each  use  of  the  scores  derived  from  the  test. 


-27- 


Figure -5.2.4    A  schematic  diagram  depicting  approaches 


to  reliability  assessment, 


Classical 
Model 


Standard  Error 
of  Measurement 

SD  /1-r 


Estimation  of 
Domain  Scores 


Binomial 
Model 


1  , 


Binomial  Error 


A 


'v  (1-7?) 

n 


Allocation  to 
Mastery  States, 


Huynh's  Method 
Subkoviak's  Method 


Swaminatban,  et  al 
Method 


29S 


5.3    Validity  of  Criterion-Referenced  Tests 

  , 

5.3.1    Int  roduc t  ion  .  4 

*    While  a  number  of  topics  addressed  in. these  instructional  materials 
haye  been  Intensely  studied  by  researqhei*' in ' the  criterion-referenced 
field  (for  instance,  observe  the  amount  of  work,  albeit  disjoint,  that  has 

9 

been  done  on  cut-off  scores),,-  criterion-referenced  ^test  validity  is  nbt  a 

..  *  \ 

member  of  this  highly-researched  grofop*    To  date,  it  has  remained  a  minimally, 

ml  •  .  4 

explored  topic,  which  is  surprising  because  of  its  importance.    The  use-  * 

9 

fulness  of  any  of  the  applications  of  criterion-referenced  tests  one  could, 
name,  e%g.*  to  monitor  individuals  though  pbjectives-based  instructional 
programs,  to  diagnose  learning  deficiencies,  to  evaluate  educational  and' 
social  action  programs,  to  assess  competence  on  certification  and  licensing 
examinations,  etc.,  depends  directly  on  the  validity  of  the  intended  inter- 
pretations  of  the  criterion-referenced  test  scores.    Why  the  lack  of 
validity  in  formation?- —One.- reason  has-been  offered  by  Hambleton  (1977)  • 
He  feels  that  most  criterion-referenced  test  developers  simply  assume  the 

) 

validity  of  scores  for  their  intended  uses,  rather  than  establishing 
validity  in  any  formal  fashion.    Further,  the  scarce  amount  of  work  that 
has  been  done  has  focused  on  the  content  validity  of  criterion-referenced 
tests,  rather  than  -      construct  validity. 

In  the  sections  that5  follow,  we  will  first  clarify  the  issue  of 
content  and  construct  validity.    Then,  we  will  discuss  some  procedures  for  1 
establishing  content  validity,  followed  by  procedures  for  establishing 
construct  validity. 


29U 


-29- 


•  A 


5.3.2    Clarification       Several  Validity  Issues 

♦ 

Cronbach  (1971),  Messlck  (1975),  and  Linn  (1977)  have  argued         .  .  ^ 
convincingly  that  to  validate  Interpretations  of  criterion-referenced     •  . 
test  scores  (I.e.,  to  determine  what  la  being  measured),  it  is  necessary 
to  proceed  beyond  a  consideration  of  the  content  validity  of  a  test.  » 

s 

Until  recently,  it- was  thought  that  content  validity  considerations  of  a 

*6      /  ... 
criterion-referenced  test  were  sufficient.     Messick  (1975)  states, 

The  major  problem.  .  .is  that  cont en tv validity.  .  . 
is  focused  upon  test  forms  rather  than  test  scores, 
upon  instruments  rather  than  measurements.  Inferences 
in  educational  and  psychological  measurement  are  made 
from  scores,  and  scores  are  a  function  of  subject  re- 
sponses.   Any  concept  of  validity  of  measurement  must 
include  reference  to  empirical  consistency.  Content 
coverage  is  an  important  consideration  in  test  con- 
'  struction  and  interpretation,  to  be  sure,  but  in 
itself  it  does  not  provide  validity.    Call  it  Content 
relevance,'  if  you  will,  or  'content  representative- 
ness,' but  don't  call  it  'content  validity'  because 
it  doesn't  provide  evidence  for  the  interpretation 
of  responses  or  scores. 

Content  validity  is  a  test  characteristic.    It  will  not  vary  across 
different  groups  of  examinees  or  vary  overtime.  However, 

validity  of  test  score  interpretations  will  vary  from  one  situation  to 
another.    For  example,  if  a  criterion-referenced  test  is  administered,  perhaps 
by  mistake,  under  highly  speeded  testing  conditions,  the  validity  of 
interpretations  based  on  test  scores  obtained  from  the  test  adminis-  . 
tration  will  be  lowes  than  if  the  test  had.  been  administered  with  more 
suitable  time  limits.    The  content  validity  of  the  test  does  not  change,  but 
the  validity  of  any  interpretation  of  the  scores  does  (or  at  least,  can) 
change  from  one  testing  situation  to  another.    Clearly  then,  content 
validity  evidence  is  not  sufficient  to  establish  "validity  of  test 


ERIC 


300 


-30- 


score  interpretations."    We  must  consider  construct  validity  to  deter- 
mine  .the*  meaning  o,f  a  set  of  scores.    To  answer  the  question  will  require 
empirical  analyses  of  test  scores.    In  fairness  to  those  who  have 

•  p  * 

stressed  the  importance  of  content  validity  for  criterion-referenced  . 
'tests  at  the  expense  of  other  types  of  validity,  we  should  note  that  there  is 
a  semantic  problem.    Although  the  content  validity  term  was  being  used  by  many 
workers  in  the  criterion-referenced  testing  field,  the  intended  meaning  for 
many  was  broader  than  the  usual  definition  of  content  validity.    Thus,  many 
kinds  of  empirical  studies  were  being  done  by  researchers  under  the 
heading  .of  .content  validity.    For  example,  Rovinelli  and  Hambleton  (1977) 
discuss  both  the  use  of  content  specialists'  ratings  and  empirical  data 
under  the  heading  of  content  validity.    .It  would  have  been  " 
helpful  to  label  the  studies  for  what  they  are,  i.e.,  construct  vali- 
dation  studies. ' 

Perhaps  at  the  risk  of  belaboring  the  point,  we  will  repeat  a 
point  which  is  stated  frequently  but  appears  to  have  been  neglected 
with  criterion-referenced  tests.    It  is  that  the  concept  of  "validity  of 
measurement11  refers  to  the  scores  and  not  to  the  test.    Messick  (1975) 
notes, 

One  validates,  not  a  test,  but  an  interpretation 
of  data  arising  from  a  specific  procedure. 

Linn  (1977).  also  makes  the  same  point, 

Questions  of  validity  are  questions  of  the  soundness 
of  the  interpretation  of  a  measure.  Thus,  it  is  the 
interpretation  rather  than  the  measure  that  is 
validated.  Measurement  results  may  have  many  inter- 
pretations which  differ  ire  their  degree  of  validity 
and  in  the  type  of  evidence  required  for  the  valida- 
tion process.  , 

301 


-31- 

f  L 

The  resolution  of  the  validity  question  would  seem  to  be  this:  m 
A  content  validity  study  is  essential  at  the  test  development  stage, 
and  the  content  validity  of  a  criterion-referenced  test  will  influence 
the  kind  of  test  score  interpretations  that  are  possible.    Also,  it  is  4 
most  important  to  conduct  construct  validation  studies  to  validate  the 
intended  use  of  the  test  scores.    The  nature  of  the  construct  yalida- 
tion  studies  will  depend  on  the  intended  use  of  the  test  scores. 

In  spite  of  its  stated  importance,  it  cannot  be  argued  either 
that  the  nature  of  content  validation  studies  with  criterion-referenced 
tests  is  well'-understood.    Guion  (1977),  for  one,  discusses  many  of  the 
problems surrounding  the  topic.    It  is  only  recently  that  any  progress 
(i.e.,.  the  development  and  field  testing  of  content  validation  methods 
with  criterion-referenced  tests)  has  been  made  (Millman,  1974;  Popham,  1978; 

Rovinelli  and  Hambleton,  1977). m 

In  summary,  content  validation  studies  will  address  the  matter 
of  content  relevance  of  material  that  finds  its  way  into  a  test.  • 
Construct  validation  studies  will  relate  to  the  matter  of  "meaning  of 
scores."    These  studies  will  include  correlational,  experimental,  as      ,  ' 
well  as  other  methods  of  investigation. 

» 

Contributing  to  some  confusion  among  test  developers  is  the 
proliferation  of  new  .validity  terms.    Domain  validity,  descriptive 
validity,  functional  validity,  domain-selection  validity,  and  incre- 
mental validity  are  but  five.    Our  preference  is  to  stay  with  the 
standard  terms,  but  define  them  clearly.  * 


302 


J 


-32- 


5.3.3*  Content  Validation  Studies 

Content  validity  has  proven  to  be  a  fuzzy  and  confusing  con- 
cept, even  for  norm-referenced  test  developers.    Some  like  Guion  (1977) 
prefer  the  term  "content  representativeness"  to  "^content  validity" 
because  the  former  expression  is  more  descriptive.    For  criterion- 
referenced  test  developers,  content  validity  refers  to  tbt  matter  of 
how  well  ah  observed  sample  of  behaviors  reflects  the  larger  domain  of 
behaviors  included  in  a  domain  specification  written  to  define  an 
objective.    "Content"  is  broadly  defined  to  include  material  from  the 
cognitive,  affective,  and  psychomotor  domains. 

Generally  speaking,  the  quality  of  criterion-referenced  tesV  items 
can  be  determined  by  the  extent  to  which  they  reflect,  in  terms  of 

their  content,  the  domains  from  which  they  were  derived.    The  problem 

■V 

here  is  one  of  item  validation;  unless  one  can  say  with  a  high 
degree  of  confidence  that  the  items  in  a  criterion-referenced  test  measure  the 
intended  objectives,     any  use  of  the  test  score  information  is  question- 
able.   When  domain  specifications  are  utilized,  the  domain  definition 
is  never  really  precise  enough  to  assume    a  priori    that  the  items  are 
valid.    Thus    the  quality  of  the  items  must  be  determined  in  a  context 
independent  from  the  process  by  which  the  items  were  generated.  This 
is  an  a  posteriori  approach  to  item  validation-      Some  procedures 
have  been  designed  to  assess  whether  or  not  a  direct  relationship  between 


303 


an,item  and  a  domain  or  objective  exists  through  analysis  of  data  col- 
lected  after  the  item  is  written    (Hambleton  &  Fitzpatrick,  in  pre- 
paration; Popham,  1978). 

There  are  £wo  approaches  which mav  be  used  t0  establish  the  (con- 
tent) validity  of  test  items.    The  first  approach,  and  the  approach 
we  feel  holds  the  most  merit,  involves  the  judgment  of  test  items  by 
content  specialists.    The  judgments  that  are  made  concern  the  extent  of 
"match"  between  the  test  items  and  the  domain  they  are  designed  to 
measure.    Questions  asked  of  content  specialists  about  content  validity 
of  test  items  can  be  reduced  to  two  important  ones: 

1*    Is  the  format  and  content  of  an  item  appropriate  to  measure 
some  part  of  the  domain  specification? 

2.    Does  the  available  set  of  test  items  adequately  sample  a 
particular  domain? 

A  second  approach  is  to  apply  empirical  techniques  to  examinee  response 
data  in  much  the  same  way  empirical  techniques  are  applied  in  norm-referenced 
test  development.    In  fact,  along  with  some  recently  developed  empir- 
ical  procedures  for  criterion-referenced  tests,  several  norm-referenced  test  items 
statistics  can  (and  should)  be  used.    The  problem  is  to  ensure  that 
these  statistics  are  used  and  interpreted  correctly  in  the  context  of  criterion- 
referenced   test  development.    Item  statistics  should  be  used  to  detect 
aberrant  items  that  need  to  be  reworked,  and  not  to  make  final  decisions  about * 
which  items  are  to  be  included  in  a  test.    An  excellent  review 

of  item  statistics  for  use  with  criterion-referenced  tests  has  been  prepared  by 

<  Berk  (1978). 


304 


-34- 


0 

ERIC 


The  first  question  is  studied  by  comparing  test  items  generated 
by  different  content  specialists  and  analyzing  the  judgments  of  content 
specialists  about  items  relative  to  the  domain  they  were  developed  to 
measure.    The  second  question  Is  a  difficult  one  to  investigate,  unless 
the  domain  of  itfcms  is  completely  specified •    Th^  question  can  be  investigated 
by  Cronbach's  (1971)  interesting  but  spjnewhat  impractical  duplication  experiment 


305 


-35- 


5.3.4    Construct  Validation  Studies 

Content  validity  evidence  does  not  address, the  matter  of  validity 
of  criterion-referenced  test  score  interpretations.    Content  validity  is 
a  characteristic  of  the  test.    Clearly  it  is  essential  to  establish 
validity  of  score  interpretations  and  therefore  construction  validation 
studies^ are  needed.    We  like  Messick' s  (1975)  definition  of  construct 
validation: 

Construct  validation  is  the  process  of  marshalling 
evidence  in  the  form  of  theoretically  relevant 
empirical  relations  to  support  the  inference  that 
an  observed  response  consistency  has  a  particular 
meaning  (p.  955). 

Messick  (1975}  offers  several  explanations  for  why  construct 
validation  studies  have  not  been  more  common  in  educational  measurement. 
For  one,  content  validity  of  criterion-referenced  tests  was  seen  as 
sufficient.        Second,       criterion-referenced  test  score  distributions 
are  often  homogeneous  (for  example,  dt  often  happens  that  before  in- 
Btruction  most  individuals  do  poorly  on  a  test,  and  after  instruction, 
Wost  individuals  do  well).    Correlational  methods  do  not  work  very  well 
with  homogeneous  distributions  of  scores  because  of  score  range  restric- 
tions.     But,  as  Messick  (1975)  has  noted, 

construct  validation  is  by  no  means  limited 
to  'correlation  coefficients,  even  though  it  may 
seem  that  way  from  the  prevalence  of  correlation 
matrices,  internal  consistent  indices,  and 
factor  analysis  (p.  958). 


306 


-36- 


Construct  validation  studies  should  begin  with  a  definite  state- 
ment  of  the  proposed  interpretation.    Thiq  will  provide  direction  for 
the  kind  of -evidence  that  is  worth  collecting.    Cronbach  (1971,  p.  483) 
notes,  n Investigations  to.be  used  for  construct  validation,  then,  should 

X 

be  purposeful  rather  than  haphazard.11  Later,  when  all  of  the  data  is 
collected  and  analyzed,  a  final  conclusion  as  to  the  validity  of  t Un- 
intended interpretation  can  be  offered. 

Let  us  next  review  some  of  the  investigations  that  could  be  s 
conducted. 


Guttman  Scalogram  Analysis 

It  frequently  occurs  that  objectives  can  be  arranged  linearly 
or  hierarchically.    Guttman  scaling  is  a  relevant  procedure 
for  the  construct  validation  of  criterion-referenced  test  items  in  situa- 
tions where  the  objectives  can  be  organized  into  either  a  linear  or 
hierarchical  sequence.    To  use  Guttman1 s  scalogram  analysis  as  a  tech- 
nique in  an  item  validation  methodology,  one  would  first  need  to  specify 
the  hierarchical  structure  of  a. set  of  objectives.    To  the  extent  that 
examinee<  responses  to  the  test  items  intended  to  measure  objectives  in 
the  hierarchy  are  predictable  from  a  knowledge  of  the  hierarchy,  one 
would  have  evidence  to  support  the  construct  validity  of  the  test  items 
as  measures  of  the  intended  objectives.    On  the  other  hand,  we  should  note 
that  in  situations  where  examinee  item  responses  are  not  predictable,  one 
of  three  situations  has  occurred: 

1.    The  hierarchy  is  incorrectly  specified; 

307 


9 

ERIC 


2.  The  items  are  not  valid  measures  of  the  intended  objectives;  or 

3.  A  combination  of  the  two  explanations. 

More  precise  specifications  for  the  utilization  of  Guttman 
Bcaling  will  of  course  be  needed  before  the  method  can  be  fully  imple- 
mented in  the  validation  process  for  criterion-referenced  test  items. 


•     •       •  f  ^ 

Factor  Analysis 

While  factor  analysis  is  a  commonly  employed  procedure  for  the 
dimensional  analysis  of  items  in  a  norm-referenced  test,  or  of  scores 
derived  fron  different  norm-referenced  tests,  it  has  rarely,  if  ever, 
been  used  in  construct  validation  studies  of  criterion-referenced  test 
scores.    Perhaps  one  reason  for  its  lack  of  use  is  that  the  usual  input 
for  factor  analytic  studies  is  correlations,  and  correlations  are  often 
low  between  items  on  a  criterion-referenced  test,  or  between  criterion- 
referenced  test  scores  and  other  variables  because  score  variability  is 
often  not  very  great.    However,  the  problem  can  be  remedied  by  choosing 
a  sample  of  examinees  with  a  wide  range  of  ability.     Required  is  a  . 
group  of  masters  and  non-masters.    The  research  problem  in  the  language 
of  factor  analysis,  becomes  a  problem  of  determining  whether  or  not  the 
factor  pattern  matrix  has  a  prescribed  form.    One  would  expect  to  obtain 
as  many  factors  in  a  factor  solution  as  there  are  objectives  covered  in 
a  test,  and  with  items  "loading"  on  only  the  factor  (or  objective)  that 
they  were  designed  to  measure.    Items  deviating  from  this  pattern  could 
be  carefully  studied  for  flaws. 

Similarly,  scores  from  many  criterion-referenced  tests  could  be  factor 
analyzed  and  the  resulting  structure  could  be  compared  to  some  structure 
specifying  a  theoretical  relationship  among  the  tests. 


In  addition,  scores  from  other  tests  might  be  correlated  to  provide  a 
base  for  convergent  and  divergent  studies* 

Experimental  Studies  of  Sources  of  Invalidity  0 
There  are  many  sources  of  error  that'  can  ^reduce  the  validity  of 
an  intended  interpretation  of  a  test  score*    Suppose,  for  example,  we 
estimated  an  examinee  to  have  an  80%  leVel  of  performance  on  a  test 
measuring  "ability  to  identify  the  main  idea  in  paragraphs.11    Is  80%  an  accurate 

assessment  of  the  examinee's  ability?    We  might  ask  about  the  influence  of  many 
factors : 

1.  How  clear  were  the  test  directions? 

2.  Was  there  any  confusion  in  using  the  answer  sheets? 

3.  Was  the  test  administered  under  speeded  testing  conditions? 

4.  Was  the  examinee  motivated? 

5.  Was  the  examinee  interested  in  the  content  of  the  paragraph? 

G 

6.  Was  the  vocabulary  suitable? 

7.  What  role  did  test-taking  skills  play  in  the  examinee's  performance? 

8.  Was  the  item  format  suitable  for  measuring  the  desired  skill?  4 

9.  At  what  time  during  the  day  was  the  test  administered? 
10.    Were  the  physical  surroundings  suitable? 

To  the  extent  that  any  of  these  (and  many  other)  factors  influence  test 
scores,  the  usefulness  of  the  test  scores  is  reduced. 

Required  are  experimental  studies  of  potential  sources  of  error 
to  determine  their  effect  on  test  scores.    Results  of  these  studies  can 
be  used  to  further  clarify  domain  specifications.    For  example,  if  we 
discovered  item  format  influenced  test  scores,  we  could  include  in  the 
domain  specifications  which  item  type  should  be  used,  after  we  determined 
which  produced  the  most  construct  valid  test  scores. 

309 


5.3.5    Summary  *  . 

In  this .section  of  the  instructional  materials  on  criterion-referenced 
tests,  we  have  tried  to  clarify  the  present  status  of  criterion-referenced 
test  validity.    We  have  explicated  the  differences  between  content  and 
conseruct  validity  and  have  discussed  procedures  for  establishing  both  . 
content  and  construct  validity. 


310 


....  ■ — I 

-40- 

5.4    Norms  for  Interpreting  Criterion-Referenced  Test  Scores 

c 

A  question  'occasionally  asked  by  criterion-referenced  test  users 
is  whether  or  not  it  makes  sense  to  use  norms  with  such  tests.  Will 

0  * 

< 

the  use  of  norms  data  enhance  or  erode  the  interpretability  of  a 
criterion-referenced  test?    Popham  (1976)  has  discussed  this  point,  and 
the  discussion  which  follows  contains  many  of  his  ideas. 

For  norm-referenced  tests,  where  there,  is  only  a  general  specifi- 
cation of  the  content  area,  being  addressed,  test  scores  derive  their 
meanings  through  comparisons  to  norm  group  data.    For  criterion-referenced 
tests,  where  clearly  described  domains  of  behaviors  are  specified,  test 
scores  can  derive  their  meaning  by  being  referenced  to  this  domain  of 
behaviors.    The  major  difference  is,  then,  in  the  ability  of  these  two 
types  of  teste  to  describe  exactly  what  a  student  can  do.    In  norm- 
referenced  testing,  where  there  is  only  a  general  description  of  the 
content  area  being  measured,  little  can  be  said  about  what  an  individual 
can  do.    In  criterion-referenced  testing,  where  the  content  area  is 
clearly  defined,  absolute  statements  about  what  an  individual  can  do  are 
possible.    The  problem  is  that  while  an  accurate  description  of  what  an 
individual  can  do  is  useful,  it  is  often  not  enough. 

A  decision-maker  often  wants  to  know  more  than  what  a  student  can 
do;  he/she  wants  to  evaluate  the  observed  level  of  performance.  The 
test  performance  of  suitable  norm  groups  provides  an  excellent  basis 
upon  which  to  gain  additional  insights  into  what  should  constitute  an 
acceptable  level  of  test  performance.    For  example,  if  it  is  known  how 
well  a  group  of  students  performed  in  a  program  the  previous  year,  or 
how  well  students  in  a  neighboring  school  district  performed  on  a 


311 


-41- 


ERIC 


particular  tiest,  it  becomes  possible  to  provide  a  framework  for  viewing 
and  interpreting  new  individual  and  group  performance  on  the  same  test. 

An  expressed  fear  about  using  norms  data  with  criterion-referenced 
tests  is  that  through  the  use' of  such  data,  criterion-referenced  testing 
•procedures  will  somehow  be  forsaken  for  norm-referenced  ones.    In  other 
words,  the  fear  is  that  by  adding  norms,  the ■  procedures  and  intest- 
ability of  criterion-referenced  tests  will  be  eroded.    This  is  not  so; 
use  of  norms  will  not  do  anything  to  the  descriptive  quality  of  the  test. 
Rather  the  use  of* norms  will  supplement  the  basic  interpretations.  Test 
scores  will  then  be  able"  to  be  interpreted  in  an    absolute  fashion  in 
reference  to  the  objectives  and  in  a  comparative  fashion  in  reference  to 
the  norms  data.    The  best  one  can  hope  for  with  a  norm-referenced  test  is 
test  score  interpretation  on  a  comparative  basis.    Hence,  the  fear  of 
eroding  the  basic  nature  of  a  criterion-referenced  test  by "introducing 

norms  is  unfounded. 

One  fear  that  is  founded,  according  to  Popham  (1976),  is  that: 

.  .  .users  of  criterion-referenced  tests  will 
unthinkingly  rely  on  normative  data  as  a 
determiner  of  performance  standards. 

Rather  than  using  the  norms  data  as  a  sole  determiner  of  standards,  a 
number  of  other  viable  procedures  can  be  used.    The  reader  should  refer 
to  Unit  6  and  the  discussion  of  cut-off  scores. 

Given  that  the  use  of  norms  data  can  supplement  the  interpretability 
of  criterion-referenced  test  scores,  how  then  should  a  test  developer  go 
about  the  task  of  collecting  and  representing  the  norms  data?    Little  / 
that  is  new  can  be  said  about  the  collection  of  the  data;  the  test  should 
be  administered  to  a  representative  sample  of  students  from  the  norm  group 


^  312 


(or  groups),  of  interest.    Of  the  numerous  ways  of  presenting  norms  data, 
one  possibility  is,  percentile  ranks.    The  percentile  rank  a  student  • 
receives  is  defined  ao  the  percentagp'V  students  in  the  norm  or  reference 
group  who  score  equal  to  or  beloW  the  student's  test  score.    What  follows 
is  a  brief  discussion  of  how  to  obtain  percentile  ranks.    The  reader 
should  consult  any  of  the  standard  test  and  measurement  texts  listed  in 
the  reference*  section  of  Unit  2  for  a  more  in-depth  discussion  of  per- 
centile  ranks. 

One  popular  method  for  calculating  percentile  ranks  is  described 
next  and  a  simple  example  is  offered. 


Computation  of  Percentile  Ranks 
-  * 

1.  Prepare  a  frequency  distribution  (f)  of  the  scores. 

2.  Find  the  cumulative  frequency  (CF),  the  number  of  persons 
scoring  lower  than  the  score  in  question,  by  summing  the 
frequency  (f)  of  scores  below  the  score  in  question. 

3.  Find  the  cululative  frequency  to  the  mid-point  (CFmp)  by 
adding  one-half  the  number  of  scores  in  the  interval  to  CF: 

CFmp  -  CF  +  -5*1 

4.  Find  the  cumulative  proportion  (CP)  by  dividing  CFmp  by  N, 

the  total  number  of  scores. 

5.  Multiply  CP  x  100.' 


313. 


-43- 


The  following  example,  based  on  an  N  of  25,  is  offered  to  demonstrate 
the  steps  listed  above.    Usually  ^a  much  larger  sample  size,  is  used  in 
setting-up  a  table  of  percentile  ranks.    A  single  test  score  has  too 

i 

much  influence  on  the  distribution  of  percentile  ranks  for  small  N. 

i  n  » 

s 

Raw  Score  f  CF  ,        CFmp  CP  Percentile  Rank 

  ■   .  1  "  1  '  • 

10  2  .    23  24.5  .98  '  98 

9  3          20  21.5  .86  86 

8  7          1,3  "  ;  16.5  .66  66 

7  6           7  10  .40  40 

6  4           3  5    <  .20  20 
5 


4 


2  1.2  .08  P 

10  .5    :       .02  2 


\ 


314 


5.5  References 

Berk,  R.  A.    Determination  of  optimal  cutting  scores  in  criterion- 

reTurencod'  meaHurumont.    Journal  of  Experimental  liducation,  1976, 
45,  4-9.  '  '  ' 

Berk,  R.  A.    Criterion-referenced  test. item  analysis  and  validation. 
Paper  presented  at  the^Jirst  Annual  Johns  Hopkins  University 

"    ,      National  Symposium  on  Educational  Research,  Washington,  1978.  . 

*■  » 

i  *  • 

Brennan,  R.  L. ,  &  Kane,  M.  T. "  An  index  of  dependability  for  mastery  tests. 

Journal  of  Educational  Measurement.  1977,  14»  277-289.  ..  S 

'  Brennan, L.,  &  Kane,  M.  T.    Signal/noise  ratios  for  domain-referenced 
;ests...  Psychometrika,  impress.  '  ^ 

.  Carver,  R.  P.    Special  problems  in  measuring  change  with  psychometric 

'devices^    In  Evaluative  research;    Strategies  and  methods.  Washing- 
ton:   American  Institutes  for  Research,  1970.  I 
fi  * 
Cohen,  J.    *A  coefficient,  of  agreement  for  nominal  scalesf    Educational  and 
Psychological  Measurement ,  1960,  20,  37-46.  •  • 

Cohen,  J.    Weighted  kappa:    Nominal  scale  agreement  with  provision  for 

scaled  disagreement  of  partial  credit.    Psychological  Bulletin,  1968, 

70,  213-220.  '  „ 

—  .  j 

'  Cronbach,  L.  J.  Test  validation.  In  R.  L.  Thorndike  (Ed.),  Educational 
measurement.  (2nd  ed.)  Washington:  American  Council  on -Education, 
1971. 

Cronbach,  L.  J.,  Gleser,  G.  JC. ,  Nanda,  H. ,  &  Rajaratnam,  N.    The  depend- 
•   ability  of  behavioral  measurements:    Theory  of  generalizability  for 
scores  and  profiles.    New  York^,.  John  Wiley  &  Sons,  1972. 

Fleiss,  J.  L.,^  Cohen,  J.,  &  Everitt,  B.  S.    Large  sample  standard  errors 
of  kappa*  and  weighted  kappa.    Psychological  Bulletin,  1969,  72,^ 
323-327.  *  .  ' 

Guion,  R.  M.    Content  validity:    The  source  of  my  discontent.  Applied 
Psychological  Measurement,  1977,  .1,  1-10. 

Haladyna,  T.  M.    Effects  of  different  samples  on  item  and  test  character- 
istics of  criterion-referenced  tests.    Journal  of  Educational 
Measurement ,  1974,  11, ,  93-99. 

Hambleton,  R.  K.    Testing  and  decision-making  procedures  for  selected 
individualized  instructional  programs.    Review  of  Educational 
Research,  1974,  44,  371-400. 

Hambleton,  R.  K.    Validation  of  criterion-referenced  test  score  interpre- 
tations.   A  paper  presented  at  the  Third  International  Symposium 
on  Educational  Testing,  University  of  Leyden,  The  Netherlands,  1977. 


315 


Hambleton,  R>4C.',  &  Fitzpatrick,  A.    Review  techniques  for  criterion- 
referenced  test  items.    Manuscript  in  preparation. 

Hambleton,  R.  K. ,  &  Novick,  M.  R.    Toward  an  integration  of  theory  and 
method  for  criterion-referenced  tests.    Journal  of  Educational 
Measurement,  1973,  10,  159-170. 

Hambleton,  R.  K. ,  Swaminathan,  H. ,  &  Algina,  J.    Some  contributions  to  the 
theory  and  practice  of  criterion-referenced  testing.    In  I>.  N.  M.  de 
Gruijter,  and  L.  J.  Th.  van  der  Kamp  (Eds. ). ,  Advances  in  psychological 
and  educational  measurement. ,  New  York:    Wiley,  1976k 

Harris',  C.  W.    An  interpretation  of  Livingston's  reliability  coefficient 
for  criterion-referenced  tests.    Journal  of  Educational* Measurement, 
1972,  9,  27-29. 

Huynh,  H.    On  the  reliability  of  decisions  in  domain-ief erenced  testing. 
Journal  of  Educational  Measurement,  1976,  13,  253-264.  (b) 

Huynh-,  H.    Reliability  of  multiple  classification.  Psvchometrika,  1978,  43,  31 

Linn,  R.  L.    Issues  of  validity  in  measurement  for  competency-based  pro- 
grams.   Paper  presented  at  the  annual  meeting  of  the  National 
Council  on  Measurement  in  Education,  New  York,  1977. 

Livingston,  S.  A.    Criterion-referenced  applications  of  classical  test 
theory.    Journal  of  Educational  Measurement,  1972,  9,  13-26.  (a) 

Livingston,  S.  A.    A  reply  to  Harris'  "An  interpretation  of  Livingston's 

reliability  of  coefficient  for  criterion-referenced  tests."  Journal 
of  Educational  Measurement,  1972,  £,  31.  (b) 

Livingston,  S.  A.    Reply  to  Shavelson,  Block  and  Ravitch's  "Criterion- 
referenced  testing:    Comments  on  reliability."    Journal  of 
Educational  Measurement,  1972,  1>  139-140.  (c) 

Lord,  F.  M. ,  &  Novick,  M.  R.    Statistical  theories  of  mental  test  scores.. 
Reading,  Maas.:    Addison-Wesley,  1968. 

Messick,  S.  A.    The  standard  problem:    Meaning  and  values  in  measurement 
and  evaluation.    American  Psychologist,  1975,  30,  955-yob. 

Millman,  J.    Criterion-referenced  measurement.    In  W.  J.  Popham  <*!.), 

Evaluation  in  education:    Current  applications.    Berkeley,  California. 
McCutchan  Publishing  Co.,  1974. 

Popham,  W.  J.    Normative  data  for  criterion-referenced  tests.    Phi  Delta 
Kappan,  1976,  58,  593-594. 


316 


-46-  * 


Pophamf  W.  J.    Criterion-referenced  measurement ■    Englewood  CliffjS,  NJ: 
Prentice-Hall,  1978. 


Rovinelli,  R.  J.,  &  Hambleton,  R.  K.    On  the  use  of  content  specialists 
in  the  assessment  of  criterion-referenced  test  item  validity. 
TUdschrift  voor  Onderwijsresearch,  1977,  2,  49-60. 

Shavelson,  R.  J.,  Block,  J.  H.,  &  Ravitch,  M.  M.  Criterion-referenced 

testings    Comments  on  reliability.    Journal  of  Educational  Measure- 
ment, 1972,  9,  133-137. 

Subkoviak",  M.    Estimating  reliability  from  a  single  administration  of  a 
criterion-referenced  test.    Journal  of  Educational  Measurement , 
1976,  13,  265-275. 

Swaminathan,  H. ,  Hambleton,  R.  K.,  &  Algina,  J.    Reliability  of  criterion- 
referenced  tests:    A  decision-theoretic  formulation.    Journal  of 
Educational  Measurement,  1974,  11.,  263-268. 


Pophara,  W.  J.,  &  Husek,  T.  R.    Implications  of  criterion-referenced 
measurement.    Journal  of  Educational  Measurement,  1969,  6.,  1-9. 


p 


Unit  6  ' 

Issues  and  Methods  for  Standard-Setting1 


Prepared  By  > 

Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

and 

Daniel  f?.  Eignor 
Educational  Testing  Service 


March  15,  1979 


Portions  of  material  in  this  unit  are  from  Hambleton  and 
Eignor  (1979)  and  Eignor  (197?)  (see  references). 


• 


ERIC 


318 


Table  of  Contents 

Page 


/  i 

6.0  Overview  of  the  Unit  „.  .  1 

6.1  Introduction.  .  •  2 

6.2  Some  Issues  in  Standard  Setting  .  .'   5 

6.2.1    Uses  of  Cut-Off  Scores  in  Decision-Making  5 

6.3  Distinction  Between  Continuum  and  State  Models   13       .  . 

6.4  Traditional  and  Normative  Procedures   15 

6.5  Consideration  of  Several  Promising  Standard  t 
Setting  Methods   18 

?     6.6     Judgmental  Methods   20 

6.6.1  Item  Content   20 

6.6.2  Guessing  and  Item  Sampling   30 

6.7  Empirical  Methods   31 

6.7.1  Data  From  Two  Groups  ^   31 

6.7.2  Decision-Theoretic  Procedures   38 

6.7.3  Empirical  Methods  Depending  Upon 

a  Criterion  Measurl   43 

6.7.4  Educational  Consequences    48 

6.8  Combination  Methods   52 

6.8.1  Judgmental- Empirical   52 

6.8.2  Bayesian  Procedures   55 

6.9  Some  Procedural  Steps  in  Standard  Setting    57 

6.9.1  Preliminary  Considerations    58  * 

6.9.2  Classroom  Testing   59 

6.9.3  Basic  Skills  Testing  for  Annual  Promotion 

and  High  School  Graduation.  .  .   64 

6.9.4  Professional  Licensing/Certification  Testing  ...  68 

6.10  Summary   70 

6 . 11  References   72 


319 


6*0    Overview  of  the  Unit 

In  this  Unit,  some  of  the  issues  involved  in  standard  setting 
along  with  methods  for  standard-setting  are  reviewed*    The  review 
will  draw  on  the  work  of  Millman,  Meskauskas,  and  Glass  and  incor- 
porate many  of  the  newer  standard-setting  methods.    The  standard- 
setting  methods  arg^prganized  into  three  categories,  judgmental 

methods,  empiric^ Methods,  and  combinations  of  judgment  and  empirical 

i 

« methods.     Procedures  for  setting  standards  to  accomplish  three 
primary  uses  of*  criterion-referenced  testing  ar$  discussed  in  a  final 
section  of  the  paper. 


320 


6,1  Introduction 

In  a  recent  review  of  the  criterion-referenced  testing  field', 
Hambleton,  Svaminath'an,  Algina,  and  Coulson  (1978)  delineated  two 
major  uses  for  test  scores  derived  from  criterion-referenced  tests: 
domain  score  estimation  and  the  allocation  of  examinees  to  mastery 
states.    The  second  use,  the  allocation  of  examinees  to  mastery 
states ,  requires  the  setting  of  a  performance  standard,  or  cut-off 
score. 

Based  upon  an  individual's  score,  on  a  test,  where  the  test  is 

a  representative  sample  of  the  subject  domain,  a  mastery/non-mastery 

decision  concerning  the  domain  from  which  the  item  sample  was  drawn 

is  sought.    Millman  (1973)  summarizes  the  situation  well: 

Of  interest  is  the  proportion  of  such  items  a 
student  can  pass.    It  is  assumed  that  some  edu- 
cational decision,  e.g.,  the  nature  of  subsequent 
instruction  for  the  student,  is  conditional  upon 
whether  or  not  he  exceeds  a  proficiency  standard 
when  administered  a  sample  of  items  from  the 
domain.    Thus,  attention  is  directed  toward  the 
individual  examinee  and  his  performance  relative 
to  the  standard  rather  than  toward  producing 
indicators  of  group  performance. 

Thus,  it  can  be  seen  that  in  this  criterion-referenced  testing 

situation,  a  cut-off  score  (there  can  be  multiple  cut-off  scores 

on  the  domain  score  scale  although  usually  only  one  is  set)  must 

be  set,  in  order  to  make  a  decision  about  an  individual's  mastery 

status.    The  results  of  this  decision  will  depend  upon  the  context 

within  which  the  test  is  being  used.    As  an  example,  consider  the 

Mastery  Learning  paradigm  (Block,  1972).    In  this  situation,  if  a 

student's  score  exceeds  the  cutting  score,  he/she  is  advanced  to 

the  next  unit  of  instruction.    If  the  student's  score  falls  below 

321 


the  standard,  remedial  activities  are  prescribed.    It  is  important 
to  understand  that  the  decision  being  made  is  on  the  level  of  the 
individualf  and  as  such,  the  status  of  other  individuals  does  not 
enter  into  the  decision.    As  a  second  example,  consider  the  use  of 
criterion-referenced  tests  to  provide  test  data  relative  to  a  set  „ 
of  basic  skills  which  student?  must  demonstrate  mastery  of  (i.e., 
achieve  specified  levels  of  performance)  in  order  to  graduate  from 
high  school.    In  this  context,  decisions  are  very  important  because 
whether  or  not  students  can  graduate  will  depend  on  their  criterion* 
referenced  test  score  performance  and  the  resulting  master/non-mastery 
decisions  which  are  made. 

These  situations  can  be  contrasted  with  the  setting  of  standards 
for  norm-referenced  tests,  which  is' considerably  less  complex.  Since 
for  tests  constructed  to  yield  norm-referenced  interpretations,  an 
individual  is  compared  to  others,  it  makes  sense  to  set  a  passing  or 
cut-off  score  so  that  a  certain  percent  of  the  students  pass.  If, 
t  for  instance,  only  20%  of  the  students  taking  an  exam  can  be  placed 
in  an  enrichment  program,  then  a  passing  score  that  passes  20%  of  the 
students  would  make  sense* 

Given  what  has  just  been  said  about  the  importance  of  cut-off 
scores  for  proper  criterion-referenced  test  score  usage,  one  would 
think  that  this  would  be  well-researched  and  documented  area.  This 
is  simply  not  the  case.    Most  of  the  work  done  to  date  has  been  con- 
cerned with  the  suggestion  of  possible  methods,  perhaps  twenty-five 
in  number,  rather  than  with  actual  empirical  investigations.  In 
addition  to  the  individual  work  done,  there  have  been  two  excellent 


322 


-4- 

re views  of  cut-score  procedures  advanced  (Mllltnan,  1973;  Meokauskas, 
1976),  and  one  recent  review  that  was  highly  critical  of  the  field 
(Glass,  1978a,  1978b). 


9 

ERIC 


323 


6,2    Some  Issues  in  Standard  Setting  • 

One  of  the  primary  purposes  of  criterion-referenced  testing  is 
to  provide  data  for  decision-making.    Sometimes  the  decisions  are  made 
by  classroom  teachers  concerning  the  monitoring  of  student  progress 
through  a  curriculum.    On  other  occasions,  promotion,  certification  and/or 
graduation  decisions  are  made  by  school district ,  and  state  adminis- 
trators. 

Glass  (1978a)  was    rather  critical    of  % 
measurement  specialists  for  giving  too  little  attention  to  the  prob- 
lem of  determining  cut-off  scores  [he  notes,  "A  common  expression  of 
wishful  thinking  is  to  base  a  grand  scheme  on  a  funuamental,  unsolved 
problem."    (p.  1)].    On  the  other  hand,  a  considerable  amount  of  criterion- 
referenced  testing  research  has  been  done.    Not  all  uses  of  criterion-referenced 
tests  require  cut-off  scores  (for  example,  description),  and  moreover,  the 
problem  does  not  really  arise  until  a  criterion- referenced  test  has, 
been  constructed.    Also,  it  should  not  be  forgotten    that  problems 
associated  with  cut-off  scores  are  difficult  and  so  solutions  are 
going  to  require  mora  time. 


used  to  "sort"  examinees  into  two  categories  which  reflect  different 
levels  of  proficiency  relative  to  a  particular  objective  measured  by 
a  test.     It  is  common  to  assign  labels  such  as  "masters11  and  "non- 
masters"  to  examinees  assigned  zo  the  two  categories,     it  is  not 
unusual  either  to  arsign  examinees  to  more  than  two  categories  based 
on  their  test  performance  (i.e.,  sometimes  multiple  cut-off  scores  are 


6.2.1  Uses  of  Cut-off  Scores 
in  Decision  Making 


A  "cut-off  score"  is  z  point  on  a  test  score  scale  that  is 


9 

ERIC 


-6- 

used)  or  to  use  cut-off  scores  that  vary  from  one  objective  to 
another  (this  may  be  done  when  it  is  felt  that  a  set  of  objective* 
differ  in  their  importance). 

It  is  important  at  this  point  to  separate  three  types  of 
standards  or  cut-off  scores.    Consider  the  following  statement:  , 

• 

School  district  A  has  set  the  following  target— 
It  desires  to  have  85%  or  more  of  its  students 
in  the  second  grade  achieve  90%  of  the  reading 
objectives  at  a  standard  of  performance  equal 
to  or  better  than  80%. 

Three  types  of  standards  arc  involved  in  the  example: 

1.  The  80%  standard  is  used  to  interpret  examinee  perfor- 
mance or  each  of  the  objectives  measured  by  a  test. 

2.  The  90%  standard  is  used  to  interpret  examinee  perfor- 
mijpce  across  all  of  the  objectives  measured  by  a  test. 

3.  The  83%  standard  is  applied  to  the  performance  of  second 
graders  on  the  set  cf  objectives  measured  by  a  test. 

In  this  unit,  only  the  first  use  of  standards  or  cut-off  scores  will 
be  considered. 

In  what  follows  it  is  important  to  separate  the  theoretical 
arguments   for  or  against  the  uses  of  cut-off  scores  from  the  uses  and 
misuses  of  cut-of.":  scores  in  practical  settings.     For  example,  it  is  v;e]l-known 
that  cut-off  scores  arc    cftcn  "pulled  from  the  nlr"  or  set  to  (say)  8C»2 
because  that  is  the  value  another  school  district  is  using.  '  But, 
the  fact  that  cut-c>£!  spore:;  are  being  determined  in  a  highly  inap- 
propriate w£>y  isobvious]y  not  grounds  for  rejecting  the  concept  of  a 
"cut-off  score."    If  the  concept  is  appropriate  for  some  particular 
use  of  a  criterion -referenced  test,  the  task  becomes  one  of  training 
people  to  set  and  to  use  cut-off  scores ^properly  (Hambleton,  1978). 


/ 
/ 


*        Four  questions  with  respect  to  the  use  of  cut-off  scores  with 
criterion-referenced  tests  require  answers; 

1.  Why  are  cut-off  scores  heeded? 

2.  What  methods  are  available  for  setting  cut-off  scores? 

3.  How  should  a  method  be  selected? 

4.  What  guidelines  are  available  for  applying  particular 
methods  successfully? 

>. 

» 

1.    Why  are  fcut-off  scores  needed? 
An  answer  to  the  question  depends  on  the  intended  use  (or  uses) 
of  the  test  score  information.    Consider  first  objectives  pr  competency- 
based  programs  since  it  is  with  these  types  of  programs  that  criterion- 
referenced  tests  and  cut-off  scores  are  most* often  used*  Objectives-based 
programs,  in  theory  are  designed  to  improve'  the  quality  of  instruc- 
tion by  (1)  defining  the  curricula* in  terms  of  objectives,  (2)  re- 
lating  instruction  and  assessment,  closely  to  the  objectives,  (3)  making 

*  '  9 

»'    ,  y 

it  possible  for  individualization  of  instruction,  and  (A)  providing 
for  on-going  evaluation.    Hard  evidence  on  the  success  of  objectives- 
based  programs  (or  most  new  programs)  is  in  short  supply  but  there  is 
some  evidence  to  suggest  that  when  objectives-based  programs  are  im- 
pigmented  fully  an  d  properly  they  are -better  than  more  "traditionally- 
oriented"  curricula  (Klausmeier,  Rossrailler,  &  Saily,  19)7;  Torshen, 
1977).    Individualization  "of  instruction  is  "keyed"  to  descriptive 


information  provided  by  criterion-referenced  tests  relative  to  examinee 

performance  on  test  items  measuring  objectives  in  the  curriculum. 

But  descriptive  information  such  as  "examinee  A  has  answered  correctly 


326 


85%  of  the  test  items  measuring  a  particular  objective"  must  be  eval- 
uated  and  decisions  made  based  upon  that  interpretation.    Has  a  student 
demonstrated  a  sufficiently  high  level  of  performance  on  an  objective 
to  lead  to  a  prediction  that  sh6/he  has  a  good  choice  of  success  on 
the  next  objective  in  a  sequence?    Does  a  student's  performance  level 
indicate  that  he/she  may  need  some  remedial  work?    Is  the  student's 
performance  level  high  enough  to  meet  the  target  for  the  objective 
defined1  by  teachers  of  the  curriculum?    In  order  to  ansyer  these  avid 
many  other  questions  it  is  necessary  to  set  standards  or  cut-off  scores. 
<How  els$  can  decisions  be  made?    Comparative  statements  about  students 
(for  example ,  Student  A  performed  fetter  than  60%  of  her  classmates) 
are  largely  irrelevant •    Carefully  developed  cut-off  scores  by  qualified 
teams  of  experts  can  contribute  substantially  to  the  success  of  an 
objectives-based  program  (competency-based>sprogram  or  basic  skills 
program)  because  cut-off-  scores  provide  a  basis  for  effective  decision- 
making. 

There  has  also  been  criticism  (Glass,  1978a)  of  the  use  of  cut- 
off scores  with  "life  skills11  or  "survival  skills11  tests.    These  are 
terms  currently  popular  with  State  Departments  of  Education,  School 
Districts,  Test  Publishers,  and  the  press.    Of  course,  Glass  is  correct 
whei**l!e*notes  that  i*  would  be  next  to  impossible  to  validate  the  classi- 
fications  of  examinees  into  "mastery  states",  i.e.,  those  predicted  to 
be  "successful"  or  "unsuccessful'  in  life.    On  the  other  hand,  if  what 
is  really  meant  by  the  term  "life  skills"  (say)  is  "graduation  require- 
ments," then  standards  of  performance  for  "basic  skills"  or  "high  school 
competency"  tests  can  probably  be  set  by  appropriately  chosen  groups  of 
individuals  (Millman,  personal  communication). 

327 


2.    If  cut-off  scores  are  needed,  what 

methods  are  available  for  setting  them?. 


Numerous  researchers  have  catalogued  many  of  the  available  methods 
(Hambleton  &  Eignor,  1979;  Hambleton  et  al.,  1978;  Jaeger,  1976; 
Millman,  1973;  Meskauskas„  1976;  Shepard,  1976).    Many  of  these  methods 
have  also  been  reviewed  by  Glass  (1978a).    It  suffices  to  say  here', 
that  there  exist  methods  based  on  a  consideration  of  (1)'  item  content, 
(2)  guessing , and  item  sampling,  (fl)  empirical  data  from  mastery  and' 
non-mastery  groups,  (4)  decision/ theoretic  procedures,  (5)  external 
criterion  measures,  and  (6),  educational  consequences.    These  methods 
will  be  considered  in 'detail  in  sections  6.6,  6.7,  and  6.8. 

What  is  clear  is  that  all  of  the  methods  are  arbitrary  and 
this  point  has  been  made  or  implied  by  everyone  whose  work  we  have 
had  an  opportunity  to  read.    The  point  is  not  disputed  by  anyone  we 
are  aware  of.    But  as  Glass  (1978a)  notes,  "arbitrariness  is  no  bogey- 
man, and  one  ought  not  to  shrink  from  a  necessary  task  because  it 
involves  arbitrary  decisions"  (p.  42).    Pbphbm  (1978)  has  given  an 
excellent  answer  to  the  concern  expressed  by  some  Researchers  about 
arbitrary  standards: 

Unable  to  avoid  reliance  on  human  judgment  as 
the  chief  ingredient  in  standard-setting,  some 
individuals  have  thrown  up  thoir  hands  in  dismay 
and  cast  aside  all  efforts  to  set  performance 
standards  as  arbitrary,  hence  unacceptable. 

But  Webster1 s  Dictionary  offers  us  two 
definitions  of  arbitrary.    The  first  of  these  is 
positive,  describing  arbitrary  as  an  adjective 
reflecting  choice  or  discretion,  that  is,  "deter- 
minable by  a  judge  or  tribunal."    The  second  - 
definition,  pejorative  in  nature,  describes 
arbitrary  as  an  adjective  denoting  capriciousness ^ 
that  is,  "selected  at  random  and  without  re<?£on." 


328 


In  my  estimate,  when  people  start  knocking  the 
standard-setting  game  as  arbitrary,  they  are 
clearly  employing  Webster's  second,  negatively  loaded 
definition. 

But  the  first  definition  is  more  accurately 
reflective  of  serious  standard-setting  efforts. 
They  represent  genuine  attempts  to  do  a  good  job 
in  deciding  what  Kind*  or  "standards  we  ought  to 
employ.    That  they  are  judgmental  is  inescapable. 

But  to  malign  all  judgmental  operations  as  capri-   

clous  is  absurd,     (p.  168)  -  —  ■ 

And,  in  fact,,    much  of  what  we  do  is  arbitrary  in  the  positive  sense  of  the 

word.    We  set  fire  standards,  health  standards,  environmental  standards, 

highway  safety  standards,  (even  standards  for  the  operation  of  nuclear  reactors), 

and  so  on.    And  in  educational  settings,  it  is  clear  that  teachers  make 

arbitrary  decisions  about  what  to  teach  in  their  courses,  how  to  teach 
their  material;  and  at.  what  pace  they  should  teach.     Surely,  if  teachers 
are  deemed    qualified  to  make  these  other  important  decisions,  they  are 
equally  qualified  to  set  standards  or  cut-off  scores  for  the  monitoring 
of  student  progress  in  their  courses.      But  what  if  a  cut-off  score  is 
set  too  high  (or  low)  or  students  are  rais»-lassif led?    Through  experience 
with  a  curriculum,  with  high  quality  criterion-referenced  tests,  and 
with  careful  evaluation  work,  standards  that  are  not  "in  line"  with 

9 

others  can  be  identified  and  revised.     And  for  students  who  are  mis- 
classified,  there  arc  some  redeeming  features*    Those  that  perform  below  the 
standard  will  be  assigned  remedial  work  and  the  fact  that  they  performed  below 
the  cut-off  score  suggests  that  they  could  not  be  too  far  above  it  (this 
would  be  true  for  most  of  the  students  about  whom  false-negative  errors 
a™    made)  and  so  the  review  period  will  not  be  a  total  waste  of  time. 


323 


-11- 


And  for  those  students  who  are  mlsclasslf led  because  they  scored  above 
a  cut-off  score,  they  will  be  tested  again.    It  Is  possible  the  next 
time' the  error  will  be  caught  (particularly  If  the  objectives  are 
sequential)*    A  comment  by  Ebel  (1978)  is  particularly  appropriate 
at  this  point: 

Pass-fal]  decisions  on  a  person' f;  achievement 
in  learning  trouble  some,  measurement  specialists 
a  great  deal.    They  know, about  errors  of  measure- 
ment.   They  know  that  some  who  barely  pass  do  so 
only  with  the  help  of  errors  of  measurement.  They 
know  that  some  who  fail  do  so  only  with  the  hindrance 
of  errors  of  measurement.    For  these,  passing  or' 
failing  docs  not  depend  on  achievement  at  all.  It 
depends  only  on  luck.    That  seems  unfair,  and  indeed 
it  is.    But,  as  any  measurement  specialist  can  explain, 
it  is  also  entirely  unavoidable.    Make  a  better  test 
and  we  reduce  the  number  who  will  be  passed  cr  failed 
by  error.     But  the  number  can  never  be  reduced  to 
zero.     (p.  549) 

The  consequences  of  false-positive  and  false-negative  errors 
with  basic  skills  assessment  or  high  school  certification  tests  are 

* 

.however  considerably  more  serious  and  so  more  attention  must  be  given 
to  the  design  of  these  testing  programs  (for* example,  content  covered 
by  the  tests,  the  timing  of  tests,  and  decisions  made  with  the  test 
results).    Considerably  more  effort  must  also  be  given  to  test  devel- 
opment, content  validation,  and  setting  of  standards. 


3,     Hov/  should  a  method  be  selected? 


There  are  many  f<nctors  to  consider  in  selecting  a  method  to 
determine  cut-off  scores.    For  example, 


330- 


1.  How  important  are  the  decisions?  * 

2.  How  much  Lime  is  available? 

3.  What  resources  are  available  to  do  the  job? 

4.  How  ^apabJLe  are  the  appropriate  individuals  of  applying 
a  particular  method  successfully? 

The  most  interesting  workwe  have  seen  to  date  regarding  tjhe 
selection  of  a  method  was  offered  by  Jaeger  (1976),    He  considers 
several  methods  for  determining  cut-off  scores,  severs]  approaches 
for  assigning  examinees  to  mastery  states,  and  various  threats  to  the 
validity  of  assignments.    While  Jaeger's  work  is  theoretic,  it  provides 
an  excellent  starting  point  for  anyone  interested  in  initiating  research 
„on  the  merits  of  different  methods.    One  thing  seems  clear  from  his 
work — all  of  the  methods  he  studied  appear  to  have  numerous  potential  drawback 
and  so  the  selection  of  a  method  in  a  given  situation  should  be  made  carefully 

t 

4>    What  guidelines  are  available  for 
applying;  particular  methods  suc- 
cessfully? 

•  Unfortunately*  there  are  relatively  few  sets  of  guidelines 
available  for  applying  any  of  the*  methods.    In  our  judgment*  Zieky  and 
Livingston  (197  7)  have  provided  a  very  helpful  set  of  guidelines  for 
applying  several  it'ethodf. ..(the  popular  Kedelsky  method  and  the  Angoff 
method  are  two  of  the  methods  included).    Some  new  work  by  Popham  (1978) 
is  also  very  helpful.    More  materials  of  this  type  and  quality  are 
needed.     Some  procedural  steps  for  standard-setting  with  respect  to  three 
important  uses  of  tests  —  (1)  daily  classroom  assessment,  (2)  basic  skills 
assessment  for  yearly  promotions  and  high  school  certification,  and  (3) 
professional  licensing  and  certification  are  provided  in  section  6.9. 

•  331 


.    *'  -13- 

\  '  '  ' 

6,3    Distinction  Between  Continuum  and  State  Models  *  t 

The  basic  difference  between  continuum  and  state  models  has  to  do 
with  the  underlying  assumption  made  about  ability.    According  to  Meskauskas, 
two  characteristics  of  continuum  models  are: 

1.  Mastery  is  viewed  as  a  continuously  distributed  ability  or  set 
of  abilities. 

2.  An  area  is  identified  at  the  upper  end  of  this  continuum,  and 
if  an  individual  equals  or  exceeds  the  lower  bound  of  this 
area,  he/she  is  termed  a  master. 


332 


-14- 

State  models,  rather  than  being  based  on  a  continuum  of  mastery,  view 
mastery  as  an  all-or-none  proposition  (i.e.,  either  you  can  do  some- 
thing or  you  cannot).    Three  characteristics  of  state  models  are: 

1.  Test  true-score  performance  is  viewed  as  an  all-or-nothing 
state. 

4 

2.  The  standard  is  set  at  100%. 

3.  After  a  consideration  of  measurement  errors,  standards  are 
often  set  at  values  less  than  100%. 

There  are  at  least  three  methods  for  setting  standards  that  are 
built  on  a  state  model  conceptualization  of  mastery.    The  models  take 
into  account  measurement  error,  deficiencies  of  the  examination,  etc., 
in  "tempering"  the  standard  from  100%.    These  methods  have  been  referred 
to  by  Glass  (1978a)  in  his  review  of  methods  for  setting  standards  as 
"counting  backwards  from  100%."    State  model  methods  advanced  to  date 
include  the  mastery  testing  evaluation  model  of  Emrick  (1971),  the 
true-score  model  of  Roudabush  (1974),  and  some  recently  advanced  statis- 
tical models  of  Macready  and  Dayton  (1977).    However,  since  state 
models   are  somewhat  less  usefulness  than  continuum    models  in  elementary 
and  secondary  school  testing  programs,  they  will  not 

be  considered  further   here.  Our  failure  to  consider  them  fur- 

ther however,  should  not  be  interpreted  as  a  criticism 

of  this  general  approach  to  standard-setting.    The  approach  seems  to 
be  especially  applicable  with  many  performance  tests   (Hambleton  &  Simon, 
in  preparation). 


9 

ERLC 


333 


-15- 

6.4    Traditional  and  Normative  Procedures 

Before  discussing  the  various,  continuum  models  of  standard 

setting,  two  other  models  for  standard-setting  should  he  mentioned. 
These  methods,  which  seem  to  have  limited  value  in  setting 

standards,  have  been  referred  to  by  a  variety  of  names. 
We  will  call  them  "traditional  standards"  and  "normative  standards." 

Traditional  standards  are  standards  that  have  gained  acceptance 
because  of  their  frequent  use.    Classroom  examples  include  the  decision 
that  90  to  100  percent  is  an  A,  80  to  89  percent  is  a  B,  etc.    It  appears 
that  such  methods  have' been  used  occasionally  in  setting  standards. 

"Normative"  standards  refer  to  any  of  three  different  uses  of 
normative  data,  two  of  which  are,  at  best,  questionable.    In  the  first 
method,  use  is  made  pf  the  normative  performance  of  some  external 
"criterion"  group.    As  an  example,  Jaeger  (1978)  cites  the  use  of  the 
Adult  Performance  Level  (APL)  tests  by  Palm  Beach  County,  Florida  schools. 
Test  performance  of  groups  of  "successful"  adults  were  used  to  set 
standards  for  high  school  students.    Such  a  procedure  can  be 
criticized  on  a  number  of  grounds.    Jaeger  (1978)  points  out  that 
society  changes,  and  that  standards  should  also  change.  Standards 
based  on  adult  performance  may  not  be  relevant  to  high  school  students. 
Shepard  (197b)  points  out  that  any  normatively-determined  standard  will 


334 


-16- 


immediately  result  in  a  multitude  of  counterexamples.    Further,  Burton 
(1978)  suggests     that  relationships  between  skills  In  school  subjects 
and  later  success  in  life  are  not  readily  determinable,  hence,  observing 
the  degree  of  achievement  on  the  test  of  some  "successful"  norm  group 
makes  little  sense.    Jaeger  (1978)  goes  on  to  say:  "There 
are  no  empirically  tenable    survival    standards  on  school-based  skills 
that  can  be  justified  through  external  means." 

A  second  way  of  proceeding  with  normative  data  is  to  make  a 
decision  about  a  standard  based  solely  on  the  distribution  of  scores 
of  examinees  who  take  the  test.    Such  a  procedure  circumvents  the 
"minimum  test  score  for  success  in  life"  problem,  but  the  procedure 
is  still  not  useful  for  setting  standards.    For  example,    Glass  (1978a) 
cites  the  California  High  School  Proficiency  Examination,  where  the  50th 
percentile  of  graduating    seniors  served  as  the  standard.    What  can 
be  said  of  a  procedure  where  whether  or  not  an  individual  passes  or 
fails  a  minimum  competency  test  depends  upon  the  other  individuals 
taking  the  test?    In  the  California  situation,  the  standard  was  set 
with  no  reference  at  all  to  the  content  of  the  test  or  the  difficulty 
of  the  test  items. 

The  third  use  of  normative  data  discussed  in  the  literature 
concerns  the  supplemental    use  of  normative  data  in  setting  a  standard. 
Shepard  (1976),  Jaeger  (1978),  and  Conaway  (1976,  1977)  all  favor  such 
a  procedure.    Recently  Jaeger  (1978)  advanced  a  standard  setting  method  which 
requires  judges  to  make  judgments  partially  on  the  basis  of  item  content. 
In  his  method,  Jaeger  calls  for  incorporation  of  some  tryout  test  data 


9 

ERLC 


335 


to  aid  Judges  in  reconsidering  their  initial  assessments.  Shepard 


(1976)  makes  the  following  point: 

Expert  Judges  ought  to  be  provided  with  normative 
data  in  their  deliberations.    Instead  of  relying 
on  their  experience!  which  may  have  been  with  un- 
usual students  or  professionals,  experts  ought  to 
have  access  to  representative  norms.  .  .of  course, 
the  norms  are  not  automically  the  standards.  Ex- 
perts still  have  to  decide  what  "ought"  to  be,  but 
they  can  establish  more  reasonable .expectations 
if  they  know  what  current  performance  is  than  if 
they  deliberate  in  a  vacuum. 

We  agree  with  Jaeger,  Conaway,  and  Shepard  about  the  usefulness 
of  normative  data  when  used  in  conjunction  with  a  standard  setting 


method. 


» 


-18- 

s. 

6.5    Consideration  of  Several  Promising 
Standard  Setting  Methods 

Remaining  methods  for  setting  standards  to  be  discussed  In  this  unit 
assume  that  domain  score  estimates  derived  from  criterion-referenced  tests 
are  on  a  continuous  scale  (hence,  the  methods  fall  under  the  heading  of 

"Continuum  Model").    For  convenience,  the;  methods  under  discussion  are 

\ 

organized  Into  three  categories.    The  methods  are  presented  In  Figure 

\ 

6.5.1.    The  categories  are  labelled  "judgmental,"  "empirical,"  and 

i 

"combination."    In  judgmental  methods,  d>ta  are  collected  from  judges  for 

5  i 

setting  standards,  or  judgments  are  tnad£  about  the  presence  of  variables 
(for  example,  guessing)  that  would  effect  the  placement  of  a  standard. 
Empirical  methods  require  the  collection  of  examinee  response  data  to  aid 
in  the  standard-setting  process.    Combination  methods,  not  surprising, 
incorporate  judgmental  data  and  empirical  data  into  the  standard-setting 
process. 


33? 


\ 


•  -4. 


Co 


Figure  6.5.1  A  classification  of  methods  for  setting  standards2 


Judgmental  Methods  1 


Combination  Methods 


Empirical  Methods1 


Item  Content 

Nedelsky  (1954) 

Modified  Nedelsky 
(Nassif,  1978) 

Angoff  (1971) 

Modified  Angoff 
(ETS,  1976) 

Ebel  (1972) 

Jaeger  (1978) 


Guessing 
Millman  (1973) 


Judgmental- 
Empirical 

Contrasting  Groups 
(Zieky  and  Living- 
ston, 1977) 

Borderline  Groups 
(Zieky  and  Living- 
ston, 1977) 


Educations 
Consequence 

Block  (1972) 


Bayesian  Methods 

Hambleton  and  Novick  (1973) 

Novick,  Lewis,  Jackson  (1973) 

Schoon,  Gullion 
Ferrara  (1978) 


Data— Two 

Groups 

Berk  (1976) 


Data-Criterion 
Measure     ■.  ■ 

Livingston  (1975) 

Livingston  (1976) 

Huynh  (197S) 

Van  der  Linden 
and  Mellenbergh 
(1977) 


Decision-Theoretic 
Kriewall  (1972) 


i 

H 
I 


T 


1 Involve  the  use  of  examinee  response  data. 
2From  a  paper  by  Hambleton  and  Eignor  (1979). 


338 


9 

ERIC 


339 


-20- 

6.6    Judgmental  Methods 
6.6,1    Item  Content 

 —  *  * 

In  this  situation,  individual  items  are  inspected,  with  the  level 
of  concern  being  how  the  minimally  competent  person  would  perform  on" 
the  items.    In  other  words,  a  judge  is  asked  to  assess  how  or  to  what  • 
degree  an  individual  who  could  be  described  as  minimally  competent  would 
perform  on  each  item.    It  should  be  noted  before  describing  particular 
procedures  utilizing  this  criterion  that  while  this  is  a  good  deal  more 
objective  than  setting  standards  based  on  any  of  the  methods  previously 
discussed,  a  considerable  degree  of  subjectivity  still  exists.    Six  pro- 
cedures based  on  item  content  assessment  will  now  be  discussed. 

i.  Nedelsky  Method 
In  Nedelsky' 8  method,  judges  are  asked  to  view  each  question  in  a 
test  with  a  particular  criterion  in  mind.    The  criterion  for  each  question 
is,  which  of  the  response  options  should  the  minimally  competent  student 

(Nedelsky  calls  them  "D-F  students")  be  able  to  eliminate  as  incorrect? 

*?  /  * 

The    minimum  passing  level  (MPL)  for  that  question  then  becomes  the  reci- 
procal, of  the  remaining  alternatives.    For  instance,  if  on  a  five-alternative 
multiple  choice  question,  a  Judge  feels  that  a  minimally  competent  person 
could  eliminate  two  of  the  options,  then  for  that  question;  MPL  -  i.  The 
judges  proceed  with  each  question  in  a  like  fashion,. and  upon  completion 
of  the  judging  process,  sum  the  values  for  each  question  to  obtain  a 
standard  on  the  total  set  of  test  items.    Next,  the  individual  judge's 
standards  are  averaged.    The  average  is  denoted  ttq. 

Nedelsky  felt  that  if  one  were  to  compute  the  standard  deviation  of 
individual  judge's  standards,  this  distribution  would  be  synonomous  with 

340 


the  (hypothesized  or  theoretical)  distribution  of  the  scores  of  the  border- 
line students.    This  standard  deviation,  a,  .could  then  be  multiplied  by  a 
constant  K,  decided  upon  by  the  test  users,  to  regulate  how  many  (as  a 
t&II&nt)  of  the  borderline  students  pass  or  fail.    The  final  formula 
then  becomes: 

*o  "  *0  +  K  0 

How  does  the  K  o  term  work?    Assuming  an  underlying  normal  distribu- 
tion, if  one  sets  K«l,  then  84%  of  the  borderline  examinees  will  fail. 
If  K«2,  then  98%  of  these  examinees  will  fail.    If  M) ,  then  50%  of  the 
examinees  on  the  borderline  should  fail.    The  value  for  ,K  is  set  by  (say) 
a  committee  prior  to  the  examination. 

The  final  result  of  the  application  of  Nedelsky1 s  method  will  be 
an  absolute  standard.    This  is  because  the  standard  is  arrived  at  without 
consideration  of  the  score  distributions  of  any  reference  group.    In  fact, 
the  standard  is  arrived  at  prior  to  using  the  test  with  the  group  one  is 
concerned  with  testing. 

The  following  example  is  included  to  demonstrate  how  the  Nedelsky 
method  can  be  applied  in  a  criterion-referenced  testing  situation. 

Example:    Suppose  five  judges  were  asked  to  score,  using  the  Nedelsky 
method,  a  six  question  criterion-referenced  test  made  up  of  questions 
that  Jifeve  five  response  options  each.    Further,  suppose  the  judges  agreed 
that  they  would  like  84%  of  the  "D-F"  or  minimally  competent  students  to 
fail  (i.e.,,  they  set  K«+l).    The  calculations  below  show  the  steps  neces- 
sary to  calculate  a  cut-off  score  for  the  test. 


341 


Test 

item 

Cut-Orr  Score  from 

Judge 

1 

2 

3 

4 

5 

r 

6 

Each  Judge 

A 

.25 

.33 

.25 

.25 

.00  : 

.33 

1.41 

B 

.25 

.50 

.25 

.  .50 

.25 

.33 

2.08  * 

C 

.33 

.33. 

.25 

.33 

.25 

.33 

.  •    r  1.82 

D 

.25 

.33 

.25 

.33 

.25 

.33 

1.74 

E 

.00 

.50 

.25 

.33 

.00 

.25 

•  1.33 

Average  Cut-Off 

Score 

(Across  Five  Judges) 

.  1.41 

+  2. 

08  +  1.81  +  1.74.,+  1.33 
s 

1.68 


Standard  Deviation  of  the  Cut-Off  Score.  -  /<l-*l-l-°8>2+<2.08-1.68)2+. ■  .+<!. 33-1.68)2 


L  m 


.380 
5 

.28 


Adjusted  Cut-Off  Score  (84%  of  Borderline    -    1.68  +  1  x  .28 
Student  to  Fail) 

-  1.96 


/  Therefore,  approximately  two  test  items  out  of  six  is  the  cut-off 
Score  on  this  test.    From  a  practical  standpoint ,  this  value  would  seem 
low,  but  the  data  is  created  to  demonstrate  the  process  and  not  to  model 
a  real  testing  situation.    Therefore,  no  practical  significance  should  be 
attached  to  the  answer.  • 


342. 


23- 


ii.  Modified  Nedelsky 

Nassif  (1978),  in  setting  standards  for  the  competency-based  teachers 
education  and  licensing  systems  in  Georgia,  utilized  a  modified  Nedelsky 
procedure.    A  modification  of  the  Nedelsl^  method  was  needed  to  handle 
the  volume  of  items  in  the  program.    In  the  modified  Nedelsky  task,  the 
entire  item  (rather  than  each  distractor)  is  examined  and  classified  in 
terms  of  two  levels  of  examinee  competence.    The  following  question  was 
asked  about  each  item:    "Should  a  person  with  minimum  competence  in  the 
teaching  field  be  able  to  answer  this  item  correctly?"    Possible  answers 
were  "yes9"  "no/1  and  "I  don't  know."    Agreement  among  judges  can  be 
studied  through  a  simple  comparison  of  the  ratings  judges  give  to  each 
item.    A  standard  may  be  obtained  by  computing  the  average  number  of  "yes" 
responses  judges  give  to  the  entire  set  of  test  items. 


343 


iii.  Ebel's  Method 

Efrel  (1972)  goes  about  arriving    at  a  standard  in  a 
somewhat  different  manner,  but  his  procedure  is  also  based  upon  the  test 
questions  rather  than  an  "outside"  distribution  of  scores.    Judges  are  asked 
to  rate  items  along  two  dimensions:  Relevance  and  difficulty.    Ebel  uses  four 
categories  of  relevance:    Essential,  important,  acceptable  and  questionable.  He 
"sesthree  difficulty  levels:  Easy,  medium  and  hard.    These  categories  then  form 

(in  this  case)  a  3  x  4  grid.    The  judges  are  next  asked  to  do  two  things: 

1.  Locate  each  of  the  test  questions  in  the  proper  cell,  based  upon 
relevance  and  difficulty, 

2.  Assign  a  percentage  to  each  cell;  that  percentage  being  the  percentage 
of  items  in  the  cell  that  the  minimally-quslif  led  examinee  should  be 
able  to  answer. 

Then  the  number  of  questions  in  each  cell  is  multiplied  by  the  appropriate 
percentage  (agreed  upon  by  the  judges),  and  the  sum  of  all  the  cells,  when 
divided      by  the  total  number  of  questions,  yields  the  standard. 

The  example  that  follows  is  model^  after  an  example  offered  by 
Ebel  (1972). 

Example;     Suppose  that  for  a  100  item  test,  five  judges  came  to  the 
following  agreement  on  percentage  of  success  for  the  minimally  qualified  candidate. 

Difficulty  Level 

Relevance  Easy  Medium  Hard 


Essential  100%*  80% 

Important  90%  707, 

Acceptable  90%  40%  30% 

Questionable  70%  50%  20% 


*Tho  expected  porconlaj't'  of  pns:;in>»  for  U.-tns  in  Hi"  ratoj'.ory. 

344       best  ccTv  Tillable 


Combining  this  data  with' the  Judges  location  of  test  questions  in 
the  particular  cells  would  yield  a  table  like  the  following: 


Item 

Category 


ESSENTIAL 

Easy 
Medium 

IMPORTANT 

„  Easy 
Medium 

ACCEPTABLE 

Easy 

Medium 

Hard 

QUESTIONABLE 

Easy 
Med  ium 
Hard 


NuruNer*  of 
Itcma* 


85 
55 


123 
103 


21 
A3 
50 


2 
8 
10 


Expect eU 
Success 


100 
80 


90 
70 


90 
AO 
30 


70 
50 
20 


Numht.T  X 
Success 


8500 
4400 


11070 
7210 


1890 
1720 
1500 


140 
400 
200 


TOTAL 


500 


37030 


37030 
500 


-  74 


*The  number  of  items  placed  in  each  category  by  all  five  of  the  judges. 


Three  comments  can  be  made  about  Ebel's  method  that  should  be  sufficient 
to  suggest  caution  when  using  it.    One,  Ebel  offers  no  prescription  for  the 
number  or  type  of  descriptions  to  be  used  along  the  two  dimensions.  This 
is  left  to  the  judgment  of  the  individuals  judging  the  items.     It  is 
likely  that  a  different  set  of  descriptions  applied  to  the  same  test 
would  yield  a  different  standard.    Two,  the  process  is  based  upon  the  de- 
cisions of  judges,  and  while  the  standard  could  be  called  absolute,  in  that 
it  is  not  referenced  to  score  distribution,  it  can't  be  called  an  "objec- 

34" 


26- 


tive"  standard.  Three,  a  point  about  Ebel's  method  has  heen  offered  by 
Meskauskas  (1976): 

,  In  Ebel's  method,  the  judge  must  simulate  the  decision 
process  of  the  examinee  to  obtain  an  accurate  judgment 
and  thus  set  an  appropriate  standard.    Since  the  judge 
is  more  knowledgeable  than  the  minimally-qualified 
individual,  and  since  he  is  not  forced  to  make  a  decision 
about  each  of  the  alternatives,  it  seems  likely  that  the  - 
judge  would  tend  to  systematically  over-simplify  the 
examineeb  task  .  .  .  Even  if  this  occurs  only  occasionally, 
it  appears  likely  that,  in  contrast  to  the  Nedelsky  method, 
the  Ebel  method  would  allow  the  raters  to  ignore  some  of 
the  finer  discriminations  that  an  examinee  needs  to  make 
and  would  result  in  a  standard  that  is  more  difficult  to 
reach,  (p.  138) 


iv.  Angoff's  Method 

When  using  Angoff's  technique,  judges  are  asked  to  assign  a  probability 
to  each  test  item  directly,  thus  circumventing  the  analysis  of  a  grid  or  the 
analysis  of  response  alternatives.    Angoff  (1971)  states: 

.  .  .ask  each  judge  to  state  the  probability  that  the 
'minimally  acceptable  person'  would  answer  each  item 
correctly.    In  effect,  the  judges  would  think  of  a 
number  of  minimally  acceptable  persons  instead  of  only 
one  such  person,  and  would  estimate  the  proportion  of 
minimally  acceptable  persons  who  would  answer  each  item 
correctly.    The  sum  of  these  probabilities,  or  propor- 
tions, would  then  represent  the  minimally  acceptable 
score.     (p.  515) 


v.  Modified  Angoff 

ETS       (1976)  utilized  a  modification  of  Angoff's  method 

for  setting      standards.        Based  on  the  rationale  that  the  task  of 
assigning  probabilities  may  be  overly    difficult  for  the  items  to  be 
assessed  (National  Teacher  Exams)  Educational  Testing  Service 
instead  supplied  a  seven  point  scale  on  which  certain  percentages  were 


ERIC 


346 


-27- 


fixed.      Judges  were  asked  to  estimate  the  percentage  of  minimally  . 
knowledgeable  examinees  who  would  know  the  answer  to  each  test  item. 
The  following  scale  was  offered: 

5        20        40        60        75        90        95  DKK 

* 

where  "DNKn  stands  for  MDo  Not  Know.n 

ETS  has  also  used  scales  with  the  fixed  points  at  somewhat  different 
values;  the  scales  are  consistent  though  In  that  seven  choice  points  are  given* 
For  the  Insurance  Licensing  Exams,  60  was  used  as  the  jcenter  point, 

since  the  average       percent  correct  on  past  exams  centered  around  60%. 
The  other  options  were  then  spaced  on  either  side  of  6Q. 


vi.    Jaeger's  Method 

Jaeger  (1978)  recently  presented  a  method  for  standard-setting  on  the 
North  Carolina  High  School  Competency  Test.    Jaeger's  method  incorporates 
a  number  of  suggestions  made  by  participants  in  a  1976  NCME  annual  meeting 
symposium  presented  in  San  Francisco  by  Stoker, Jaeger,  Shepard,  Conaway, 
and  Haladyna;    it  is  iterative,     uses      judges  from  a  variety  of  back- 
grounds, and  employs  normative  data.    Further,  rather  than  asking  a 


9 

ERIC 


347 


-28- 


question  Involving  ''minimal  competence,"  a  term  which  is  hard  to  opera- 
tionalize,  and  conceptualize,  Jaeger's  questions  are  instead: 

"Should  every  high  school,  graduate  be  able  to  answer 

this  item  correctly?"    "  Yes,   No."  and 

"If  a  student  does  not  answer  this  item  correctly, 
should  he/she  be  denied  a  high  school  diploma?" 
"  Yes,   No." 

After  a  series  of  iterative  processes  involving  judges  from  various  areas 
of  expertise,  and  after  the  presentation  of  some  normative  data, 
standards         determined  by  all  groups  of  judges  of  the  same  type  are 
pooled,  and  a  median  is  computed  for  each  type  of  judge.    The  minimum 
median  across  all  groups  is  selected  as  the  standard. 


Comparisons  Among  Judgmental  Models 

We  are  aware  of  two  studies  that  compare  judgmental  methods  of 
setting    standards;        one  study  was  done  in  1976,  the  other  is  pre- 
sently underway  at  ETS. 

In  1976,  Andrew    and  Hecht  carried  out  fln 
empirical  comparison  of  the  Nedelsky  and  Ebel  methods.    In  that 

study,  judges  met  on  two  separate  occasions  to  set  standards  for  a 
180  item,  four  options  per  item,  exam  to  certify  professional  workers. 
On  one  occasion    the  Nedelsky  method  was  used.   On  a  second  occasion  the  Ebel  method 
was  used.     The  percentage  of  test  items  that  should  be  answered  correctly 
by  a  minimally  competent  examinee  was  set  at  69%  by  the  Ebel  method  and 
at  46%  by  the  Nedelsky  method. 


348 


ERIC 


-29- 

Glass  0978a)  described  the  observed  difference  as  a  "startling  finding11. 

Our  view  is  that  since  directions  to  the  judges  were  different,  and 

procedures  differed,  we  would  not  expect  the  results  from  these  two 

methods  to  be  similar.    The  authors  themselves  report: 

It  is  perhaps  not  surprising  that  two  procedures 
which  involve  different  approaches  to  the  eval- 
uation of  test  items  would  result  in  different 
examination  standards.    Such  examination  standards 
will  always  be  subjective  to  some  extent  and  will 
involve  different  philosophical  assumptions  and 
varying  conceptualizations,     (p.  49) 

Ebel  (1972)  makes  a  similar  point: 

.  .  .It^is  clear  that  a  variety  of  approaches  can 
be  used  to  solve  the  problem  of  defining  the  pass- 

 ing  scorer Unfortunately,  different  approaches  are 

likely  to  give  different  results,     (p.  496) 

Possibly  the  most  important  result  of  the  Andrew-Hecht  study 

was  the  high  level  of  agreement  in 

the  determination  of  a  standard  using  the  same  method  across  two  teams 

-of  judges.    The  difference  was  not  more  than  3.4%  within  each  method. 

Data  of  this  kind  address  a  concern  raised  by  Glass  (1978a)  about 

whether  judges  can  make  determinations  of  standards  consistently  and 

reliably.    At  least  in  this  one  study,  it  appears  that  they  could. 

From  our  interactions  with  staff  at  ETS  who  conduct  teacher  workshops 

on  setting  standards,  we  have  learned  that  teams  of  teachers  working 

with  a  common  method  obtain  results  that  are  quite  similar.    And  this 

result  holds  across  tests  in  different  subject  matter  areas  and  at 

different  grade  levels.    We  have  observed  the  same  result  in  our  own 

work.    Of  course,  certain  conditions  must  be  established  if  agreement 

among  judges  is  to  be  obtained.     Essentially,  it  is  necessary  that  the  judge 

share  a  common  definition  of  the  "minimally  competent M  student  and  fully 

understand  the  rating  process  they  are  to  use. 

349 


V  -30- 

6.6.2  Guessing  and  Item  Sampling 

:  m 

In  this  section,  some  concerns  initially  expressed  by  Millman  (1973) 
about  errors  due  to  guessing  and  item  sampling  will  be  discussed. 

If  the  test  items  allow  a  student  to  answer  questions  correctly  by 
guessing,  a  systematic  error  is  introduced  into  student  domain  score  esti- 
mates.    There  are  three  possible  ways  to  rectify  this  situation: 

1.  The  cut-off  score  can  be  \raised  to  take  into  account  the  con- 
tribution expected  from  the  guessing  process. 

2.  A  student's  score  can  be  corrected  for  guessing  and  then  the 
adjusted  score  compared  to \ the  performance  standard. 

3.  The  test  itself  can  be  constructed  to  minimize  the  guessing  process* 
Methods  one  and  two  assume  that  guessing  is  of  a  pure,  random  nature, 

which  is  not  likely  to  be  the  case  for  criterion- referenced  tests.  Thus, 
adjusting  either  the  cutting  score  or  the  student's  scores  will  probably 
prove  to  be  inadequate.    The  test  must,  be  structured  to  keep  guessing  to  a 
minimum,  because  if  it  occurs,  it  can't  be  adequately  corrected  for. 

Also,  if  because  of  problems  of  test  construction,  inconvenience  of 
administration,  or  a  host  of  other  problems,  the  test  is  not  representative 
of  the  content  of  the  domain,  then  Millman  (1973)  suggests  that  the  cutting 
score  or  standard  be  raised  (or  lowered)  an  amount  to  protect  against       —  - 
misclassification  of  students;  i.e.,  false-positive  and  false-negative 
errors.    Millman  offers  no  methods  for  determining  the  extent  or  direction 
of  correction  for  these  problems.    We  feel  "that  the  test  practitioner  should 
exert  extra  effort  to  assure  that  the  problem  just  discussed  doesn't  occur 
in  the  first  place.    Once  again,  there  doesn't  appear  to  be  an  adequate 
method  for  "correcting  awayM  the  problem. 


350 


6.7    Empirical  Methods 

6.7.1  Data  From  Two  Groups 
 —  ' 

Berk  (1976)  presented  a  method  for  setting  cut-off  scores  that  is 
based  on  empirical  data.    He  selects  empirically  the  optimal  cutting 
score  for  a  test  based  upon  test  data  from  two  samples  of  students,  one 
of  which  has  been  instructed  on  the  material,  and  the  other  uninstructed. 
Before  discussing  his  methodology,  where  he  offers  three  ways  of  proceeding 
based  upon  the  data  collected,  it  is  worth  discussing  why  he  chose  to 
formulate  his  model  in  the  first  place.    He  suggests  that  the  extant  ap- 
proaches of  a  nature  similar  to  his,  namely  those  based  on  the  binomial 
distribution .and  those  based  upon  Bayesian  decision- theory,  suffer  from 
a  deficiency.    According  to  Berk: 

The  fundamental  deficiency  of  all  of  these  methods 
is  their  failure  to  define  mastery  operationally 
in  terms  of  observed  student  performance,  the 
objective  or  trait  being  measured,  and  item  and 
test  characteristics.    The  criterion  level  or 
cutting  score  is  generally  set  subjectively  on 
the  basis  of  "judgment"  or  "experience"  and  the 
probabilities  of  Type  1/Type  II  classification 
errors  associated  with  the  criterion  are  estimated. 

One  of  Berk's  procedures  considers  false-positive  and  false-negative  errors 

but  the  difference  is  that  the  results  are  based  upon  actual  data. 

Berk  offers  three  ways  of  approaching  the  problem  of  setting  standards 

utilizing  empirical  4ata:     (1)  Classification  of  outcome  probabilities, 

(2)  computation  of  a  validity  coefficient,  and  (3)  utility  analysis. 

i.  The  Basic  Situation 
Two  criterion  groups  are  selected  for  use  in  this  procedure,  one  group 
comprised  of  instructed  students  and  another  of  uninstructed  students. 
The  instructed  group  should,  according  to  Berk,  "consist  of  those  students 
who  have  received  'effective'  instruction  on  the  objective  to  be  assessed." 


Berk  suggests  that  these  groups  should  be  approximately  equal  in  size  and 


large  enough  to  produce  stable  estimates  of  probabilities-.    Test  items 
measuring  one  objective  are  then  administered  to  both  groups  and  the  dis- 
tribution of  scores  (putting  both  groups  together)  can  be  divided  by  a 
cut-off  score  into  two  categories. 

Combining  the  classifications  of  students  by  predictor  (test  score)  and. 
criterion  (instructed  vs.  non-instructed  status)  results  in  four  categories 
that  can  be  represented  in  a  2  x  2  table,  with  relevant  marginals: 

1.  True  Master  (TM,* :    an  instructed  student  whose  test  score  is  above  the 
cutting  score  (C). 

2.  False  Master  (FM) :    A  Type  II  misclassification  error  where  an  unin- 
structed  student's  test  score  lies  above  the  cutting  score  (C) . 

3.  True  Non-Masters  (TN) :    An  uninstructed  student  whose  test  score  lies 
below  the  cutting  point  (C) . 

A.    False  Non-Masters  (FN):    Type  I  misclassification  where  an  instructed 
student's  test  score  lies  below  C. 

* 

Tabularly,  this  can  be  presented  as  follows.  Note  how  the  marginal  are  defined 
because  they  are  used  in  the  formulations  to  follow. 


CRITERION  MEASURE 


Instructed 
(I) 


Uninstructed 
(U) 


o  Predicted 
vi    ■  Masters 
o  to  .  PM-TM+FM 


(TM) 


TypevII 
(FM) 


O  -rt  —  — 

•rl   *J  _ 

tj  w  Predicted 

u  o  Non-Masters 


Type  I 
(FN) 


FN-FM+TN 


(TN) 


Masters 
M-TM+FN 


I 

i 


Non-Masters 
N-FM+TN 


\ 

\ 


-33- 


11,  Classification  of  Outcome  Probabilities 

*  \ 

In  this  procedure,  Identification  of  the  optimal  cutting  score  involves  * 
an  analysis  of  the  two-way  classification  of  outcome  probabilities  shown  above. 
This  can  be  done  algebraically  by  following  the  steps  listed  below,  or  graphically, 

\ 

as  illustrated  in  a  subsequent  section.  \  The  steps  to  follow  are; 

1.  Set  up  a  two-way  classification  of  the  frequency  distribution  for  each 
possible  cutting  score. 

2.  Compute  the  probabilities  of  the  h  outcomes  (for  each  cutting  score)" 
by  expressing  the  cell  frequencies  as  proportions  of  the  total  sample. 
For  instance: 

Prob  (TM)  »  TM/ (M+N) 

Prob  (FM)  «  FM/(M+N)  \ 

Prob  (TN)  »  TN/(M+N) 

Prob  (FN)  -vFN/(MfN) 

3.  For  each  cutting  score,  add  the  probability  of  correct  decisions: 
Prob  (TM)   +    Prob  (TN),  and  the  probability  of  incorrect  decisions: 
Prob  (FN)  +  Prob  (FM) . 

A.    The  optimal  cutting  score  is  the  score  that  maximize© Prob  (TM)  + 
Prob  (TN)  and  minimizes  Prob  (FN)  +  Prob  (FM) .    It  is  sufficient  to 
observe  the  score  that  maximizes  Prob  (TM)  +  Prob  (TN)  because  [Prob 
(FN)  +  Prob  (FM)]  «  1  -  [Prob  (TM)  +  Prob  (TN)].    That  is,  the  score 
that  maximizes  the  probability  of  correct  decisions  automatically  minimizes 
probability  of  Incorrect  decisions. 


353 


lilt  Graphical  Solution 
Berk  (1976)  also  mentions  that  the  optimal  cutting  point  for  a 
criterion-referenced  test  can  be  located  by  observing  the  frequency 
distributions  for  the  instructed  and  uninstructed  groups.    According  to 

Berk;:  - 

The  instructed  and  uninstructed  group  score 
distributions  are  the  primary  determinants 
of  the  extent  to  which  a  test  can  accurately 
classify  students  as  true  masters  and  true 
non-masters  of  an  objective.    The  degree  of 
accuracy  is,  for  the  most  part,  a  function  of 
the  amount  of  overlap  between  the  distribution* 

If  the  test  distributions  overlap,  no  decisions  can  be  made.  The 
ideal  situation  would  be  one  in  which  the  two  distributions  have  no  . 
overlap  at  all.    A  typical  situation  we  should  hope  for  is  for  the  in- 
structed group  distribution  to  have  a  negative  skew,  the  uninstructed 
group  to  have  a  positive  skew,  and  for  there  to  be  a  minimum  of  overlap. 
The  point  at  which  the  distributions  intersect  is  then  the  optimal  cut-o 
score. 

In  Figure  6.7.1,  the  distributions  of. test  scores  for  two  groups 
of  examinees  (one  instructed  group  and  one  uninstructed  group)  are 
shown. 


354 


-35- 


Figure  6.7.1  Frequency  polygons  of  criterion-referenced  test 
scores  for  two  groups  -  an  instructed  group  and  an 
uninstructed  group  on  the  content  measured  by  the  test. 


20 


o 
c 

3 

0) 


10 


■ 

/     *\  — 

/     \  - 

( 

m/ 

i 
i 

A  / 

\       c  \ 

A 

D  \. 

1 


3  4  5 
Test  Score 


Uninstructed  Group  (N»70) 
Instructed  Group  (tf»80) 

6 

Four  Types  of  Examinees 

A:  Non-Masters — 

Correctly  Classified 
B:  Masters — 

Incorrectly-  Classified 
C:  Masters— 

Correctly  Classified 
D:  Non-Mas ters— 

Incorrectly  Classified 


8 


Frequency  Distribution  of  Test  Scores 

Test  Score 

U  I  Group 

I  Group 

8 

0 

7 

7 

2 

10 

6 

* 

5 

18 

5 

8 

20 

4 

11 

15 

3 

18 

5 

2 

13 

3 

1 

9 

2 

0 

4 

0 

355 


*  •  * 

*    '  iv.  Validity  Coefficient 

T  -   - •  *  *  *■   •  —  •  < 

'  r  ...  0  «  '  . 

?  % 

\  • 

-In  this  procedure,  n  validity  coefficient,  is  computed  for  each  possible  . 

.   \  '  •  '  • 

cutting  sepre.    The  cutting  score  yielding  the  highest  validity  coefficient 

also  yields  the  hi yhest  probability  of  correct  decisions.    To  utilize  the 

procedure,  the  following  steps  should  be  followed:  ' 

1*    From  the  *two-way  classification  introduced  earlier,  compute  ^the 

1  c 

base  rate  (BR)  and  the  selection  ratio  (SR) .    They  ate  given  by: 
BR  «  Trol)  (FN)  +  Prob  (TM)  '  „ 
SR  -Prob  (TM)  +  Prob  (FM)  •* 

2.  Calculate  the  phi  coefficient  0vc  uaing  the  following  formula: 

  0      m    Prob  (TM)  -  BR  (SR) 

3.  The  cutting  scoi\e  yielding  the  highest  0yc  is  the  optimal  putting  score. 
The  formula  for  the  pfri  coeff icient> 0    ,  given  above  is  suitable  for  a 

Vw  ■ 

2x2  table  of  cell  probabilities.    More  generally g  the  phi  coefficient  Is 

the  Pearson  product   moment    correlation  between' two  dichotomous  variables, 

*  < 
and  could  be  arrived  at  as  follows: 

1.  Each  student  with  a  test  score  above  the  cutting  scorc'in  question 
is  assigned  a  1,  below  a  0.  « 

2.  Each  studei/t  in  the  instructed  group  is  assigned  a  1,  in  the  unin- 
structed^group,  a  0.  ' 

3.  tfvc  would  then  bu  the.  correlation  coefficient  computed  in  the  usual 
way.  k 

356 


# 


-37- 


9 

ERIC 


v Utility  _A i la  lysis 

In  this  section,  costs  or  losses  are  assigned  to  the  misclassification 
of  students  as  false  masters  oi  false  non-masters.    Tlic  procedures  here  ore 
closely  tied  to.  the^ecision-theoretic  procedures  discussed  in  a  later  flection. 
The  procedure  is  presented  at  this  point  because  it  can  be  related  to  the 
two  Berk  procedures  just  discussed. 

First  of  all,  Berk  notes  the  following  fact: 

When  the  outcome  probabilities  or  validity  coefficient  approach 
is  used  to  select  the  optimal  cutting  score,  it  is  assumed 
that  the  2  types  of  errors  are  equally  serious.    If,  however, 
this  assumption  is  not  realistic  in  terns  of  the  losses  which 
may  result  from  a  particular  decision,  f:he  error  probabiHties 
need  to  be  weighted  to  reflect  the  magnitude  of  the  losses 
associated  with:' the  decision. 

Berk  notes  that  determination  of  the  relative  size  oi  each  loss  is  judgmental, 
and  must  be  guided  by  the  consequences  of  the  decision  considered.    He  men- 
tions considering  the  following  factors:    Student  motivation,  teacher  time, 
availability  of  instructional  materials,  content,  and  others.    Berk  suggests 
the  following,  which  we  have  capsulized  into  a  series  of  steps: 

1.  Estimate  the  expected  disutility  of  a  .ic .Vision  strategy  (O  by 

Ck  -  Prob  (FN)[DX]  +  prob  (FM)   (n  ] 
where  Ti^  and  D2  <  0 

and  k  «  the  single  decision  in  question 
^1  nnd  D2  "  respective  disutility  values 

2.  Estimate  the  expected  utility  of  a  decision  strategy  (v)  by 

vk  -  Prob  (TM)  (U1]  +  Prob  (TN)  [Uj] 
where  UjL  <md  U2  >  0 

and  k  -  the  single  decision  in  question  (same  as  for  disutility) 
V1  and  U2  -  respective  utility  values 

357 


3.  Form  a  composite  measure  of  test  usefulness  by  combining  the 
estimates  of  utility  and  disutility  across  all  decisions 

n 

Y  «    Z     <vk  +  Ck) 

k*=l 

Y  ■  index  of  expected  maximal  utility. 

4.  Choose  the  cutting  score  with  .the  highest  Y  index  (it  maximizes 

i 

the  usefulness  of  the  test  for  decisions  with  a  specific  set  of 
utilities  and  disutilities)* 

vi.  Suggestions 

The  procedures  developed  by  Berk  (1976)  hold  considerable  promise 
for  use  in  setting  criterion-referenced  test  score  standards.    The  ideas 
in  his  procedures  are  now  new;  there  are  other  procedures  that  are  con- 
cerned with  the  maximization  of  correct  decisions  and  the  minimization  of 
false-positive  and  false-negative  errors.    The  attractive  feature  is  the 
ease  with  which  Berk's  methods  can  be  understood  and  applied.     The  major 
potential  drawback  is  in  the  assignment  of  examinees  to  criterion  groups. 
If  many  examinees  in  the  "instructed  group"  do  not  possess  the  assumed 
knowledge  and  skills  measured  by  the  criterion-referenced  test  (or  if 
many  examinees  in  the  "uninstructed  group'1  do).  Berk's  methods  will  pro- 
duce inaccurate  results. 

6,7,2    Decision-Theoretic  Procedures 

Berk  (1976)  looked  at  the  minimization  of  false-positive  and  false- 
negative  decisions  through  the  use  of  actual  test  data.    He  selects  as 
optimal  the  cutting  score  that  minimizes  false-positive  and  false-negative 
errors.    Another  way  to  look  at  false-positive  and  false-negative  errors 
is  to  assume  an  underlying  distributional  form  for  your  data  and  then 


-39- 

observe  the  consequences  of  setting  values,  such  as  cutting  points,  based 

upon  the  distributional  model.    The  logic  is  the  same  here  in  terms  of 

minimization  of  errors,  except  that  by  assuming  a  distributional  form, 

actual  data  does  not  have  to  be  collected.    Situations  can  be  simulated  or 

developed,  based  upon  the  model. 

Meskauskas  (1976)  has  related  and  compared  these  procedures  to  those 

based  upon  analyses  of  the  content  of  the  test.    In  reference  to  these 

models,  of  which  we  will  describe  one: 

.  .  .the  models  to  follow  deal  with  approaches  that 
start  by  assuming  a  standard  of  performance  and  then 
evaluating  the  classification  errors  resulting  from 
its  use.     If  the  error  rate  is  inappropriate,  the 
decision  maker  adjusts  the  standard  a  bit  and  tries 
his  equation  again. 

Before  discussing  one  of  the  procedures  in  greater  detail,  the  Kriewall 

binomial-based  model,  the  procedures  discussed  here  should  be  related  to 

criterion-referenced  testing  procedures  involving  the  determination  of  test 

length.    Many  of  the  test  length  determination  procedures  (Millraan,  1973; 

Novick  &  Lewis,  1974)  make  underlying  distributional  assumptions  and  proceed 

in  the  fashion  discussed  above  by  Meskauskas.    The  focus  of  concern,  however, 

is  test  length  determination,  and  not  the  setting  of  a  cutting  score.  In 

fact,  Millman's  (1973)  procedure  is  based  upon  exactly  the  same  underlying 

distribution,  the  binomial/  as  is  Kriewall1 s  nodel  to  be  discussed.  It 

should  be  pointed  out  that  the  procedures  are  exactly  the  same,  the  data 

is  just  represented  differently  because  of  the  level  of  concern,  either 

cutting  score  or  test  length. 


359 


i.  Kriewallfs  Model 
Kriewall's  (1972)  model  focuses  on  categorization  of  learners  into 
several  categories:    Non-master,  master,  and  an  in-between  state  where 
the  student  has  developed  some  skills,  but  not  enough  to  be  considered 
a  master. 

Kriewall  assumes  the.  function  of  measurement,  using  the  test,  is 
to  classify  students  into  one  of  two  categories,  master  or  non-master. 
Of  course,  the  test,  as^a  sample  of  the  domain  of  tasks,  is  going  to  mis- 
classify  some  individuals  as  false-positives  (masters  based  on  the  test, 
but  non-masters  in  reality)  and  false-negatives  (non-masters  on  the  test, 
but  masters  in  reality).    By  assuming  a  particular  distribution,  these 
errors  may  be  studied. 

Kricw.ill 'i;     probability  model,  used  to  develop  \\u\  likelihood  of 
lassif icat ion  errors,  is  based  upon  the  binomial  distribution.    He  assumes 

1.  Thj  test-  represents  a  randomly  selected  set  of  dichotouiously  scored 
(0-1)  items  from  the  domain. 

2.  The  likelihood  of  correct  response  for  n  p.iven  individual  Is  a 
fixea  quantity  for  all  items  measuring  a  given  objective. 

3.  Responses  to  questions  by  an  individual  are  independent.  That 
is,  the  outcome  of  one  trial  (taking  one  question)  is  independent 

,  of  the  outcome  of  any  other  trial. 
A.     Any  distribution  of  difficulty  of  questions  (for  an  individual) 
within  a  test  is  assumed  to  he  a  function  of  randomly  occurring 
erroneous  responses  (Mnskauskas,  1976). 


360 


-41- 


With  these  assumptions!  Kriewall  views  n  student's  test  performance 
as  "a  sequence  of  independent  Bernoulli  trials,  each  having  the  some 
probability  of  success/'  A  sequence  of  Bernoulli  trials  followsa  binomial 
distribution,  which  has  a  probability  function  which  relates  the  probability 
of  occurrence  of  an  event  (a  particular  test  score)  to  the  number  of  questions 
in  the  test  by: 


where 


and 


ct  \  /n\  x  n-x 
f(x)  -  (x)p  q 


x  «  n  test  tiro  re 


n  a  total  number  of  test  ilems 
p  -  examinee  domain  score 
q  »  1-p 


x7        x!  (n-x)! 


Kricwall  »et«:  moihc  hound -irv  i  !' 

thti  probability  of  roiscJ assiUcation  errors      !v.,n„  H 

n  crrorf"    l«*nfi  the  not «t ion  r.f  Meskuuska.s  (]  976), 


set : 


\  -  the  lower  bound  of  th.  ma8tcry  ^  (jW  fl  proportl<|n  of  crrors) 
»2  -  the  upper  bound  of  the  non-mast  cry  tanp.c 
N       C    =  the  cutting  score;  the  maximal  number  of  allowable  errors  for 


masters.  Kriewall  recommends  C  »  -±  *  *2 

2 


Civen  values  lor  the  above  three  varices,  Krtewal,  „ses  the  (assuned) 
binomial  distribution  to  deterge  ^probabilities.    U  .  is  the  probabilU, 
"  Positive  result  (a  non-mastor  who  scores  i„  tho  nmatery  c.ao,.0rv) 


361 


and  6  is  the  probability  of  a  false  negative  result  (a  master  who  scores  in 
the  non-mastery  category),  then  a  and  B  are  given  by: 

PC  * 

w£o  0      a  - 


a 

w=»c 


where    „  .  observed  nus.ber  of  errors  (and  „  =  „-„)  for  an  indlvlduai.  ' 
According  to  Meskauskas  (1976)  the  formula  for  a  is: 

I  ;  'T^M""       ob^inlng  the  probability  that,  given 
of' equivalent  trials,  a  person  whose  true 

wm  finTa  „t0  the  lowest  score  ln  th0  must«y  «nge 

will  fall  in  the  non-mastery  range. 
By  setting  gj  a„„  ^  ,t  varlou6  ^  fi  .  *1  +  *2 

the  probabilities  of  fa.s.  positive  and  f.IlW  m.(,„,,lv,  crrorB  ,„„  b<  ,tu-lt- 
The  optimal  value  fo,  C  (and  thus  gj  and  g.,)  would  then  be  the  value  that 
minimized  o  and  8.    The  results  are  dependent,  however,  on  „  and  w. 

I  i .     Suj'jjost  iouii  N 
WhiJf  Krlcwnll  has  offered        ,1  mot  hod  of  sludyinj»  rj/i.»:sl  f  i-- 
cation  errors  Hint  does  not  depend  upon  ariiinl  dam,    we  prefer  the 
method  of  Berk,  due  to  its  simplicity .     KrlowalJ's  model  rjeema  to 
us  lo  {Lt        much  better  with  ihc  procedures  on  to«t  length 

determination.     For  instance,  nuppore  yon  h.w<-  r.pee  i  r  UuV  niin  im.i] 
values  for  a  and  p,  and  have  determined  C,  the  cutting  point.  Then 
the  formulas  above  for  u  and  3  can  be  solved  for  n,  the  total  number 
of  questions  needed.     (It  would  be  much  easier  if  one  isolated  n  on 
the.  left  hand  side).     This  is  exactly  what  is  done  when  lining  i  ho. 
binomial  model  to  solve  the  test  length  problem. 


362 


-43- 

In  sum,  we  prefer  the  Berk  method  for  observing  probabilities 
of  misclassif ication  errors  both  because  of  it«  simplicity  and  because  „ 
of  the  lack  of  restricting  underlying  distributional  assumptions. 
Kriewall's  method  does,  however,  offer  a  viable  alternative  for 
setting  a  cut-off  score  when  actual  test  data  cannot  be  collected. 

*  6. 7.3  Empirical  Models  Depending 

Upon  a  Criterion  Measure  * 

The  models  to  be  discussed  in  this  section  bear  great  resemblance 

to  both  Berk's  and  Kriewal.l's  methods  just  discussed.    They  have  * 

been  separated  from  those  two  methods  because  these  Methods  arc  built 

upon  the  existence  of  an  outside  criterion  measure,  performance 

measure,  or  true*  ability  distribution,    'mo  i:c:*t  itr.nJf,  and  the 

possible  cut-off  scores,  are  observed  in  relationship  with  this  out- 

>^  • 

4 

side  measure.    The  optimal  cut-off  1s  then  chosen  in  reference  to 

the  criterion  measure*    For  instance,  Liv i.njv«ton 1  s  (IU/5)  utiliLy- 

based  approach  leads  to  thtf  selection  of  n  cut-off  score  that  optimizes 

a  particular  utility  function.    The  procedure  (if  Vandcr  Linden  and 

M4ilcnburgh  (197'/)  f  in  contract,  leads  to  '  I  he  selection  of  a  cut-off 

score  that  minimizes  expected  loss. 

« 

In  reference  Lo  the  setting  of  performance  standards  based  upon 
benefit  Cmd  cost)  Milimnn  (1973)  has  suggested  Lhat  psychological 
and  financial  cosls  be  considered: 

« 

Ail  things  bring  equal,  a  low  passing  r.core 
should  be  used  when  the  psychological  I  and 
financial  costs  associated  with  n  remedial, 
instructional  program  are  relatively  high. 
That  is,   there  should  be  fewer  failings  when 
the  costs  o-T  failing  are  high.     These  "costs" 
might  include  lower  motivation  and  boredom, 


ERlc  .  353 


-44- 

•  * 

damage  to  self-concept,  and  dollar  and  time 
expenses  of . conducting  a  remedial  instruc- 
tional program*    A  higher  passing  score  can 
be  tolerated'~wh&it  the  above  costs  are  not  too 
groat  or  when  the  negative  efforts  of  moving 
a  student  too  rapidly  through  a  curriculum 
(i.e.,  confusion,  inefficient  learning  and 
so  forth)  are  seen  as  very  important  to  avoid* 

In  sum,  to  utilize  these  procedures,  a  suitable  outside  criterion 

measure  must  exist.     Success  and  failure  (or  probability  of.  success 

and  failure)  is  then  defined  on  the  criterion  variable  and  the  cut-off 

chosen  as  the  score  on  the  test  that  maximizes  (or  minimizes)  some 

function  of  the  criterion  variable1.    The  existence  of  such  a  criterion 

•   variable  has  implications  for  the  utilization  of  these  methods  for 

setting  cut-off  scores  on  minimum  competency  tests. 

i 

f 

Livingston  (1  075)  suggests,  iho  use  of  a  set:  .of  linear  01 
semi-linear  utility  functions'  in  viewing  the  effects  of  decision- 
making accuracy  based  upon  a  particular  performance  standard  or 
cut-off  score.    That  is,  the  functions  relating  benefit  (and  cost) 
of  a  decision  are  related  linearly  to  the  cutting  score  in  question, 
Livingston's  procedure  is  like  Berk's  procedure  for  utility 
analysis    discussed  in  6.7.1    except  that  Livingston  develops  his 
procedure  based  upon  any  suitable       ,         criterion  measure  (not 
just  instructed  versus  uninstructed) ,  and  also  specifics  the  rela- 
tionship between  utility  (benefit  or  loss)  and  cutting  scores  an 
linear.     The  relationship  does  not  have  to  be  linear;  however,  vising 
such  a  relationship  simplifies  natters    somewhat.       In  such  a  situation 
the  cost  (of  a  bad  decision)  is  proportional  to  the  size  of  the 
errors  made  an d  the  benefit  (of  a  good  decision)  is  proportional  to 
O  the  size  of  the  errors  avoided.         n  r»  a 

eric  364 


-45- 

t 

m 

il:..  V«n. . <! or  1  .indpn  n nd  Mel :lt-nbu r >;,J >  '_s 
Approach  *  " 

The  developers  of  this  procedure  have  pivserj  bed  n  method  for 
setting  cutting  scores  that  is  related  both  to  berk's  procedure  and 
Livingston's.    We  will  describe'the  procedure  briefly  and  in  the 
process  relate  it  to  berk's  work.    A  test  score  is  used  to  classify 
examinees  into  two  categories:    Accepted  (scores  above  the  cutting 
score)  and  rejected  (scores  below).    Also,  a  Intent  ability  variable 
is  specified  in  advance  and  used  to  dichotomize  the  student  popula-  ' 
tion:    Students  above  a  particular  point  on  the  latent  variable  are 
considered  "suitable"  and  below  "not  suitable."    The.  situation  may 
be  represented . as  follows. 


I.al  en  l: 
Not  .suit able 
Y  <  d 

Var Inbl c 
Sui  table 
Y  >  d 

Decision 

Accepted 
XiC 

"False  +" 

Vi  (Y) 

*n  (Y) 

Rejected 

X  <  c 

"False 
So  <>> 

where  C  «  cutting 

score  on  the  criterion-referenced 

test 

d  ■  cutting 

score  on  the  latent 

variable  (0  < 

d  I  1), 

and  where  I 

j 

Lj  (i»J  *  0,1)  is  a  function  of  y  and  related  in 

iosH  functiou: 

/^0o(y)  to*  Y  <  il, 

X  <  C 

L 

\flf)(Y)  for  Y  I  d, 

X  <  C 

1 *01(Y)  for  Y  <  d, 

X  >  c 

■erJc 


\^U(Y)  for  Y  >  d,  X  >  C 

355 


The  authors  then  specify  risk  (the  quantity  to  be  minimized)  as  the 
expected  loss,  and  the  putting  score  that  is  optimal  is  the  value  of  C  that 
minimizes  the  risk  function  (expected  value  of  loss).    They  simplify  mat- 
ters (as  does  Livingston)  by  specifying  their  loss  function  as  linear. 

In  sum,  while  Van  der  Linden  and  Mellenburgh  have  provided  a  method 
for  setting  a  cut-off  score  on  the  test,  they  have  offered  little  to  help 
in  setting  the  cut-off  on  the  latent  variable.     In  a  sense  then,  they  have 
only  transferred. the  problem  of  setting  a  standard  to  a  different  measurel 

Hi.  Livingajtjnn1  s_U8C  of  S tor haa tic 
Approximation  Techniques 

Livingston  (1976)  has  developed  procedure -for  setting  cut-off 
scores  based  upon  stochastic  approximation  procedures.    According  to 
Livingston,  the  problem  involving  cut-off  scores  can  be  phrased  as 
follows  to  fit  stochastic  procedures:     "Jn  general,  the  problem  is 
to  determine  what  level  of  input  (written  test  score)  is  necessary  to 
produce  a  given  response  (performance),  when  measurements  of  the 
response  are  difficult  or  expensive."    The  procedure,  according  to 
Livingston,  is  as  follows: 

1.  Select  a  person;  record  his/her  lost  score  and  measure 
his/her  performance. 

2.  If  the  person  succeeds  on  the  performance  measure  (if 
his/her  performance  is  above  the  minimum  acceptable), 
choose  next  a  person  with  a  somewhat  lower  test  score. 
If  the  person  fails  on  the  performance  measure,  choose 
a  person  with  a  higher  written,  tost  score. 

3.  Repeat  step  2,  chooslnjj  the  third  person  on  the  basis  oT 
the  second  person's  measured  pa c I ormance.  ■ 

366 


-47- 

Ltvingston  offers  two  different  procedures  for  ehoosinj;  step  size, 
the  up-and-down  and  the  Uobbins-Monro  Procedure,  and  n  number  of 
procedures  for  estimating  minimum  passing  sc.orcsconsonant  with  each. 

Tills;  procedure,  like  those  discussed  r.irl  ier  in  thiii  section, 
depends  upon  I  he  rxfntvnce  of  ;i  ruL-ftmrr  oniahl  Is1n.d  on  another 
variable,  this  rime  the  performance  measure,  in  order  to' establish 
the  passing  score  on  the  test.    Thin  then  limits  greatly  the  applica- 
bility of  the  method.    Livingston  (personal,  communication,  1978) 
<has  suggested  that  judgmental  data  on .performance  can  be  used, 
rather  than  actual  performance  data,  with  the  procedure,  but  this 
has  yet  to  tje  documented  in  any  fashion.    When  documented,  the 
possibilities  for  use  of  the  procedures  will  be  greatly  expanded. 

iv.  Huynh 's  Procedures  * 
Huynh  (1976)  has  advanced  procedures  for  setting  cut-off  scores 
that  are  predicated  on  the  existence  of,  a  "referral  task."  This 
referral  task  can  be  envisioned  as  an  external  criterion  to  which 
competency  can  be  related.    For  instance,  Huynh  (1976)  states  that 
"Mastery  in  one  unit  of  instruction  may  not  be  reasonably  declared  if 
it  cannot  be  assumed  that  the  masters  would  have  better  chances  of 
success  in  the  next  unit  of  instruction. 11    The  next  unit  in  this  case 
would  be  the  referral  task. 

These  procedures  once  again  depend  upon  an  outside  criterion 
variable  to  permit  the  estimation  of  a  cut-score.  In 


367 


-48- 


s 


this  case,  the  user  of  the  method  Is  aiik«*d  lo  |.»r  ahl  i«h  Uu<  proh- 

t 

ability  of  success  of  Individuals  on  the  ...forr.,1  task.  Because 
of  the  necessity  of  a  criterion  variable  fy  operation,  these  pro-  ' 
cedures  suffer  in  fieneralizability.    They  are,  for  instance, 
apparently  not  useful  for  minimum  competency  testing  situations  J  where 
a  criterion  variable,  and  associated  probability  of  success,  are 
next  to  impossible  to  establish. 

'■•  6./. 4    Educational  Consequences 

In  this  situation,  one  is  concerned  with  looking  at  the  effect  setting 
a  standard  of  proficiency  has  on  future  learning  or  other  related  cognitive  or 
affective  success  criteria.  According  to  Mlllman' (1973) ,  the  question  here 
"What  passing  score  maximizes  educational  benefits?". 

This  approach  can  be  visualized  from  an  experimental  design  point  of 
view.    A  subject  matter  domain  is  taught  to  a  class  of  students  who  are  then 
tested  on  the  material.    These  students  are  assigned  (randomly)  to  groups 
with  the  groups  differing  on  the  performance  level  required  for  passing  the 
test.    The  students  are  then  assessed  on  some  valued  outcome  measure  and  the 
level  of  performance  on  the  criterion-referenced  test  for  which  the  valued 
outcome  is  maximaj.  (it  could  be  a  combination  of  valued  outcomes)  becomes 
the  performance~"s'tandard  or  criterion  score. 

Thus,  to  use  this  method,  much  more  data  needs  to  be  collocrru  than 
for  the  item  content  procedures.    An  experiment  must  be  conducted,  and 
then  a  cut-off  score  is  selected  based  upon  the  results  of  the  experiment. 


368 


ERIC 


-49-  / 


Because  of  the  difficulties  involved  in  designing  and  carrying  out 
experiments  in  school  settings,  the  method  is  unlikely  to  find  much 
use.  ' 


i.  Block's  Study 

Block's  study  (1972)  involves  students  learning  a   subject  segment  on  matrix 
algebra  using  a  Mastery  Learning  paradigm.  ,  Such  a  paradigm  dictates  that 
students  who  don't  perform  adequately  on  the  posttest  be  recycled  through 
remedial  activities  wit  11  they  demonstrate  mastery  (re:  attain  a  score,  above 
the  cutting  score),    block  established  four  gtjoups  of  students,  where  each 
group  was  tested  using  one  of  the  following  four  performance  standards:  65, 
75,  85,  and  95%  of  the  material  in  a  unit  must  be  mastered  before  proceeding 
on  the  next  unit.    He  then  examined  the  effects  of  varying  the  performance 
standard  on  six  criteria    that  were  used  as  the  variables  to  be  maximized. 

:  i 

Viewing  these  criteria  as  either  cognitive  or  affective,  Block  observed 
that  the  95%  performance  level  maximized  student  performance  on  the  cognitive 
criteria, while  the  85%  performance  level  seemed  to  maximize  the  affective 
criteria'. 

*\ 

Some  comments  on  Block's  study  are  in  line.    One,  the  results  lack  general* 
izability.    The  95%  and  857.  levels,  which  maximize  the  cognitive  and  affective 
measures  respectively,  are  likely  to  change  with  the  subject  matter. 
Two,     as  pointed    ouL  by    Glass    (1978a),  Lht;    method  of 
maximizing  a  valued  outcome  assumes  that  there  is  a  distinct  point  or 


9 

ERIC 


369 


criterion  ecore  on  the  CRT  that  maximizes  the  outcome.    What  if  the  .curve 
relating  performance  on  the  CRT  is  monotonies  lly  increasinR,  oo  that  IQQ%< 
performance  on  the  CRT  maximizes  the  valued  outcome?    In  fact, 
it  is  more  likely  to  be  the  case  that  the  graph  is  monotonical^y  increasing 
than  the  case  where  the  graph  increases  and  decreases.    For  example:  • 
1.    Monotonically  increasing  graph  (Problem  situation) 


Valued  ■ 
Outcome 


100% 


CRT 


2*    Ideal  situation 


Valued 
Outcome 


0/. 


70% 


CRT 


100% 


(Reproduced  from  Glass,  1978a,  pennies iun  for  roproduct Jon  pending.) 

Thus,  it  can  be  seen  that  unless  the  sraph  increases  and  then  decrease. o 
a  LOOS  performance  atund:ird  will  be  optimal.       This  standard   Is  of  ]  iinited  use 
because  it  is  not  realistic  to  expect  all  students  to  attain  that  level. 

I 


370 


-51- 


Third,  Rlock  discusses  that  If  there  are  r.uilUpW  criteria  to  be 
maximized  as  valued  outcomes,  then  somr  Model  ior combining  crii-ria  with  relevant 
weights  needs  to  he  developed.  '        He  docs  not  offer  any  procedures  for 

doing  so    however,  and  he  looks  at  the  effects  of  the  performance  standards 
on  each  of  the  6  criteria  separately,    it  should  be  noted  that  multiple 
criteria  is  a  way  around  the  problem  discussed  above  (Glass,  1978a). For  instance, 
if  one  of  the    outcomes  has  a  mono ton ically  increasing  relationship  with  the 
test  scores  and  the  other  a  monotonically  decreasing  relationship,  then  the 
composite  should  have  a  peak  value  at  a  point  other  than  0%  or  1007..  While 
this  would  seem  to  solve  the  problem,  another  problem  is  only  further 
exacerbated;  what  weights*  should  be  assigned  to  the  valued  outcomes  to 
form  the  composite?    These  procedures  have  not  yet  been  developed,  and  fur- 
ther, they  are  likely  to  be  situation  specific. 


371 


6.8    Combination  Methods 


6.8.1  JudKmental-Empirical 

Zieky  and  Livingston  (1977),  and  more  recently,  Popham  (1978),  have 

suggested  two  procedures  that  are  based  upon  a  combination  of  judgmental 

and  empirical  data.     In  addition,  both  Zieky  and  Livingston  and  Popham  have 

t  A 


372 


-53- 

iueluded  an  in-depth  d  iftcussioii  of  how  l.o  implement  the*  procedures, 
something  lhaibis  been  lacking  with  many  other  procedures.  The 
two  procedures  piosented  hy  Zicky  and  1*1  vhigr.t  on,  the  liorderline- 
Group  and  Contrasting-Groups  methods,  arc  procedurally  similar. 
They  differ  in  the  sample  of  students  on  which  performance,  data  is 
collected.     Further,  while  judgments  arc  required,  the  judgments 
necessary  are  on  students;  not  on  items,  as  are  many  of  the  other 
judgmental  mcVnods  (Nedelsky,  Angoff,  Ebel,  etc.)..    Zieky  and 
Livingston  make  the  case  that  judging  individuals  Is  likely  to  be  a 
more  familiar  task  than  judging  items.    Teachers  are  the  logical 
choice  as  judges,  and  for  them,  the    assessment  of  individuals  is 
commonplace, 

} 

i>  Borderline — Cro.up  Method 
This  method  requires  that  judges  first 
define  what  they  would  envision  as  minimally  acceptable  performance 
on  the  content  area  being  assessed.     The  judges  are  then  asked  to 
submit  a  list  of  students  (about  100  students)  whose  performances 
are  so  close  to  the  borderline  between  acceptable  and  unacceptable 
that  they  canft  be  classified  into  either  group.  „  The  test  is 
thus  administered  to  this  group,  and  the  median  test  score  for  the 
group  is  taken  as  the  standard. 

C:iM^tr:ist-Lu*»-(Jroup  _ Met hod 
Once  judges  have  defined  minimally  acceptable  performance7 for 
the  subject  area  being  assessed,   the  judges  are  asked  to  identify  th 

373 

> 


students  they  are  sure  are  either  definite  masters  or  non-masters 
of  the  skills  measured  by  the  test.    Zieky  and  Livingston  suggest 
100  students  in  the  smaller  group  in  order  to  assure  stable  results. 
The  test  score  distributions  for  the  two  groups  are  then  plotted  and 
the  point  of  intersection  is  taken  as  the  initial  standard.    This  is 
exactly  the  same  as  the  graphical  procedure  suggested  by  Berk,  and 
presented  in  section  6.7.1.     Zieky  , and  Livingston  then  suggest  ad- 
justing the  standard  up  or  down  to  reduce  Mfalse  masters11  (students 
identified  as  masters  by  the  test,  but  who  have  not  adequately  mastered 
the  objectives)  or  "false  non-masters"  (students  identified  as  non- 
masters  by  the  test,  but  who  have  adequately  mastered  the  objectives). 
The  direction  to  move  the  cut-off  score  depends  on  the  relative 
seriousness  of  the  two  types  of  errors. 

ili.  Suggestions  # 
These  methods,  particularly  t.he  Contrast  In^-Groups  Method,  are 
very  similar  to  tine  procedure  suggested  by  Ucrk.     Instead  01  actually 
forming  instructed  and  uninstructed  groups,  however,' as  suggested  by 
Berk,  the  Contrasting-Groups  Method  asks  judges  io  form  the  groups. 
This  judgmental  procedure  would  seem  more  advantageous  when  the  content 
being  assessed  has  had  a  long  instructional  period   (minimum  competency 
testing  is  an  example),  or  when  there  would  bo  problems  justifying 
che  existence  of  an  uninstructed  group.     iierk's  method  would  be  more' 
useful  for  tests  based  on  short  instructional   segment'.*:,  mosJ  likely 
administered  at  the  classroom  level. 

A  comparison  of  the  judgments  involved  in  the  two  procedures 
indicates  that  the  Contrasting-Groups  Method  would  be  tnc  easier 

374 


method  to  justify  vj^ing.    It  is  a  more  reasonable  task  for  teachers 
to  identify  "sure"  masters  and  non-masters  than  it  is  for  them  to 
identify  borderline  students  in  the  subject  area  being  assessed.  In 
sum,  the  Contrasting-Groups  Method  appears  to  us  to  be  a  most  reasoft- 
able  way  of  setting  standards* 

6.8.2  Bayesian  Procedures 

Novick  and  Lewis  (1974)  were  the  first  to  suggest  that  Bayesian 
procedures  are  useful  for  setting  standards.    Schoon,  Gullion,  and' 
Ferrara  (1978)  have  more  recently  discussed  Bayesian  procedures  for 
setting  standards.    According  to  Schoon  et  al*,  Bayesian  procedures 
allow  the  incorporation  of: 

1.  A  loss  ratio,  reflecting  the  severity  of  false-positive 

v 

and  false-negative  decision  errors, 

.« 

2.  prior  information  on  the  distribution  of  domain  scores  in 
the  population  of  interest, 

3.  current  information  on  an  examinee's  domain  score,  and 

4.  the  degree  of  certainty  that  an  examinee's  domain  score 
exceeds  the  cut-off  score. 

Of  course,  a  cut-off  score  must  first  be  set  in  order  for  the  four 
faotors  to  be  incorporated.    Thus,  Bayesian  procedures  offer  a  way 
of  augmenting  the  establishment  of  a  cut-off  ,score  rather  than  a 
method  for  setting  the  cut-off  score  itself. 


375 


-56- 

In  sum,"  Kaycslan  procedures  present  .-i  method  for  augmenting 
the  setting  of  a  cut-off  score  by  utilising  aval lnhJv  prior  mul 
collateral  information.    The  procedure  also  provides  a  posterior 
statement  of  degree  of  certainty  about  candidate's  performance. 
Bayesian  procedures  do  not,  however,  offer  a  method  for  setting  a 
...cntnacore  in  the  first  place.     Rnycsian  procedures  have  been  included 
in  this  review  because  thoy  do  offer  a  method  for  rombin  inj\  judgmental 

j 

and  empirical  data  to  arrive  at  a  revised  standard. 


376 


ERIC 


Q  b\  •  . 

-57- 

6.9    Some  Procedural  Steps  in  Standard  Setting 

In  earlier  sections  of  this  unit,  issues  and  many  methods  for  standard- 
setting  were  discussed.     In  this  section,  procedures  will  be  outlined 
for  setting  standards  on  criterion-referenced  tests  used  for  three  dif- 
ferent purposes.    The  purposes  considered  are: 
•  1.    Classroom  testing 

2.  Basic  skills  testing  for  yearly  promotion  and  high  school 
graduation 

3.  Professional  licensing' and  certification  testing* 
Classroom  testing  is  emphasized  since  classroom  teachers  have  fewer 
technical  resources  available  to  them  than  do.the  larger  testing  programs. 
Our  ultimate  objective* is  to  provide  a  comprehensive  set  of  practical 
guidelines  for  practitioners.    At  this  time  the  guidelines  are  far  from 
comprehensive;  much  research  is  needed  to  supply  information ^necessary 

to  construct  thorough  guidelines.    We  have  suggested  in  places  some  of 
the  questions  that  need  to  be  answered. 

Certain  things  are  assumed:     first,  that  in  each  case  a  set  of 
objectives  or  competencies,  .has  ..been  agread  .upon,  ..and  that  .they  are- 
described  via  the  use  of  domain  specifications  or>some  other  equally 
appropriate  method.     Second,  it  is  assumed  that  no  fixed  selection  ratio 
exists  (e.g.,  one  might  be  fixed  in  effect  by  having  resources  to  provide 

^ 

only  a  certain  number  of  students  with  remedial  work)  since  if  it  does 
there  is  no  reason  to  set  standards.     Finally,  we  do  not  discuss  the 
import  ant  and  interesting  political  issues  of  who  participates  in  and 
who  controls  the  standard-setting  process;  we  take  as  given  that  some 
such  process  exists  and  only  address  the  issue  of  participation  from  the 
perspective  of  practicality. 

377  " 


-58- 

6.9.1    Preliminary  Considerations 

Before  any  standard  setting  is  undertaken  for  any  purpose,  an 
analysis  of  the  decision-makii\g  .context  and  of  the  resources  available 
for  the  project  should  be  done.    The  results  of  this  analysis  will 
determine  how  extensive  and  sophisticated  the  standard-setting  procedure 
should  be.    Analysis  of  the  decision-making  context  involves  judging 
the  importance  .of  the  decisions  that  are  to  be  made  using  the  test, 
the  probable  consequences  of  those  decisions,  and  the  costs  of  errors. 
Others  have  discussed  using  these  same  considerations  in  adjusting  the 
final  standard,  but  tbcy  may  also  be  helpful  in  choosing  a  standard- 
setting  method.  v-Formal  procedures  for  using  this  information  are 
probably  not  necessary;  a  discussion  of  the  issues  by  those  directing 
the  project  should  suffice.    Some  issues  to  consider  would  include  (1) 
the  number  of  people  directly  and  indirectly  affected  by  the  decisions 
to  be  bapecL  on  the  test;  (2)  possible  educational,  psychological, 
financial,  social  and  other  consequences  of  the  decisions;  and  (3) 
the  duration  of  the  consequences. 

The  next  step  should  be  a  consideration  of  the  resources  available 
for  the  standard  setting.     Resources  include  money,  materials,  dock  time 
personnel  time  and  expertise.    How  much  of  the  total  amount  of  available 
resources  will  be  dedicated  to  the  standard  setting  will  depend  upon  the 
results  of  the  prior  discussion  of  decision  context.    The 'final  decision 
as  to  the  resources  to  be  invested  will  determine  how  large  and  tech- 
nically sophisticated  the  standard-settir-?  enterprise  may  be. 

A  great  deal  of  information  needs  to  be  collected  on  the  actual  ex- 
penditures of  various  resources  that  have  been  required  to  carry  out 


37S 


-59- 

standard  setting  by  different  methods  in  different  contexts.    Actual  time 
and  money  data  would  be  invaluable  to  practitioners  in  choosing  a 

method  for  their  own  situation.     In  the  following  discussion  procedural 

i 

steps  in  increasing  order  of  expense  and  complexity  will  be  offered  but 
real  data  on  these  factors  is  lacking  ^nd  is  a  pressing, need, 

V* 

6,9,2    Classroom  Testing 

The  classroom  teacher  is  most  likely  to  use  criterion-referenced 
tests  for  diagnostic  purposes,  that  is  for  determining  whether  a  student 
has  mastered  an  area  or  needs  further  work  in  it.    This  would  seem  to 
be  the  most  common  situation  calling  for  the  Setting  of  standards.    Here  ' 

the  teacheir  must  decide  what  level  .of  test  performance  constitutes 

«  * 
"mastery,"    In  the  same  testing  context  the  teacher  may  set  additional 

performance  standards,  above  and/or  below  the  minimal  levfel,  for  the 

awarding  of  grades  on  the  material* 

Typically  the  classroom  teacher  works  alone,  or  at  most  with  one 
or  more  other  teachers  of  the  same  grade.    It  is  also  quite  often  the 
case  that  a  cLissroom.  exam^is  used. only  once ♦    In  these  situations  methods 
based  only  on  judgment  of  test  content  may  be  the  only  ones  practicable. 
The  methpds  developed  by  Ebel,  Nedelsky  and  Angoff  would  be  appropriate 
here,  and  thf;  details  of  each  of  them  have  been  discussed  in  an  earlier 
section,  so  we  will  not  re-iterate  procedural  steps  here, 

TJhen  Available  resources  permit  involving  more  people  in  the  standard 
setting,  parents  and  other  community  members  might  be  enlisted,  or  a  grout) 
Qf  teachers  of  one  grade  from  an  entire  school  district  might  collaborate 
in  sec ting  standards.  Again,  if  resources  permit,  data  on  group  performance 
on  individual  items  may  be  tabulated  and  considered  in  setting  the  standards 
on  subsequent  tests,  or  if  tests  are  retained  from  year  to  year,  the 

379  — — 


V 


-60- 


performance  data  from  the  previous  year  might  be  used.    Of  course,  this 
can  also  be  done  by  teachers  working  alone,    The  following  Is  a  list  of 
steps,  some  of  which  could  be  omitted  if  resources  were  limited,  for 
involving  parents  of  students  in  a  particular  class  in  setting  standards 
fos>r  classroom  tests  over  units  of- instruction.    The  method  borrows 
heavily  from  Jaeger  (1978).     (It  is  assumed  that  the  objectives  have 
been  identified  and  the  teacher  (or  teachers)    has  prepared  domain  speci 
fications)  :.. 

1.  At  , the  beginning  of  the  school  year,  a  letter  is  sent" 
to  parents  explaining  the  project  and  inviting  them  to 
a.  meeting ^where  more  information  will  be  given. 

2.  At  the  meeting  parents  are  given  copies  of  domain  speci- 
fications for  the  first  test,  along  with  example  items. 
They  are  asked  to  indicate  for  each  objective  a  percentage 
of  items,  which  answered  correctly  would  demonstrate  the 
student  had  mastered  the  material  adequately.    At  this 
meeting  they  should  be  encouraged  to  discuss* the  task 

and  ask  any  questions  they  might  have  about  it.  * 

Instructions  accompanying  the  standard-setting  task  should  indicate  to 
the  parents  how  their  judgment  will  be  employed  (for  example,  averaged 
with  the  percentages  indicated  by  every  other  parent,  and  the  resulting 
standard  applied  to  every  child  in. that  class  or  grade).    We  have -sug- 
gested for  reasons  of  test  security  that  the  parents  base  their  judgment 
on  domain  specifications  rather  than  ori  actual  test  items;    if  test 
forms  from  previous  years  are  available  and  thought  to  be  parallel  to 
the  new  exam,  it  may  be  easier  for  parents  to  make  their  judgments  as 
a  percentage  correct  of  items  on  the  parallel  test. 

3.  The  teacher  constructs  the  criterion-referenced  test 
from  the  domain  specifications  before  looking  at  the 
parents 1  standards . 

4.  Class  performance  data  is  tabulated  after  the  test  is 
administered. 


3SU 


5;    Parent  judgment  for  the  second  test  (or  set  of  tests) 
is  solicited  by  mail.    The  mailing  packet  includes: 
domain  instructions  (duplicating  those  given  at  the 
earlier  meeting),  and  performance  data  from  the  first 
test  (number  of  students  achieving  each  set  standard). 

Instructions  would  also  stress  that  judgments  were  to  be  based  primarily 

on  domain  specifications  and  only  secondarily  on  performance  data. 

6.  Step  5  is  repeated  during  the  year  whenever  a  competency- 
type  test  is  to  be  given. 

Alternatively,  this  procedure  might  be  reserved  for  those  instructional 

units  judged  to  cover  basic,  required  objectives  for  that  grade;  parents1 

instructions  would  then  identify  the  tes^ted  materials  as  such, 

7.  The  teacher  keeps  files  for  each  £est,  including  the 
domain  specifications,  parent  judgment  forms,  actual 
exam  and  performance  data. 

„  8.    Periodic  meetings  can,  be  held  to  review  the  instructions 
and  to  discuss  the  procedure  and  its  results. 

Such  discussions  may  lead  to  parents  questioning  the  performance  of  student 
and  is  likely  to  provoke  query  into  both  the  teachers  methods  and  his/her 
subject  matter.    Teachers  should  be  prepared  for  this;  it  may  lead  to 
parents  wanting  greater  involvement  in  determining  other  aspects  of  their 
children's  schooling,  a  desire  brie  hopes  can  be  creatively  and  construc- 
tively used. 

Other  variants  on  this  procedure  can  include  appointing  a  small 
committee'  of  parents,  possibly  working  with  several  teachers,  instead 
of  an  open  parents  group.    A  parent-objective  (matrix)  sampling  strategy 
could  be  employed  to  reduce  the  number  of  judgments  required  of  each 
parent . 


4 

i 


381 


«  v  -62-   •  ^ 

Another  procedure  for  setting  standards  with  criterion-referenced 
tests  in  instructional  settings  was  offered  by  Hambleton  (1978).  Ac- 
cording to  Hambleton,  "[His]  is  not  a  'validated  list1  of  guidelines. 
It  is  a  list  of  practical  guidelines  I  have  evolved  over  the  years 
through  my  work  with  numerous  school  districts."    His  eleven  step  list 
of  guidelines  is  as  follows: 

*   .     1.    The  determination  of  cut-off  scores  should  be  doiie  by  several 

groups  working  together.    These  groups  include  teachers,  parents, 
curriculum  specialists,  school  administrators,  and  (if  the  tests 
are  at  the  high  school  level)  students.    The  number  from  each 
group  will  depend  upon  the  "importance  of  the  tests  under  con- 
sideration and  the  number -of  domain  specifications.    At  a 
minimum,  I  like  to  have  enough  individuals  to  form  at  least 
two  teams  of  reviewers.    This  way  I  can  compare  their  results 
on  at  least  a  few  domain  specifications  to  determine  the 
consistency  of  judgments  in  the  two  groups.    When  sufficient 
time  is  available  I  prefer  to  obtain  two  Independent  judgments 
of  each  cut-off  score. 

   _   / 

2.1  usually  introduce  either  the  Ebel  method  or  the  Nedelsky 
method.    Following  training  on  one  of  the  methods,  I  have  the 
groups  work  through  several  practice  examples.  Differences 
between  groups  are  discussed  and  problems  are  clarified. 

3.  The  domain  specifications  (or  usually,  but  less  appropriate, 
,the  objectives)  a^e  introduced  and  discussed  with  the  judges. 

4.  I  try  to  set  up  a  schedule  so  that  roughly  equal  amounts  of 
time  are  allotted  to  a  consideration  Of  each  domain  specifica- 
tion.    If  some  domain  specifications  are  more  complex  or 
important. I  usually  assign  them  more  time. 

/  i 

^  m     5.    I  make  sure  that  the  judges  are  aware  of  how  the  tests  will  be 

used \ and  with  what  groups  of  students. 

\ 

6,     If  there  exist  any  relationships  among  the  domain  specifications 
(or  objectives)  the  information  is  noted.    For  example,  if  a 
particular  objective  is  a  prerequisite  to  several  others  it 
may  be  desirable  to  set  a  higher  cut-off  score  than  might  other- m 
wise  be  set. 

,7.    Whenever  possible  I  try  to  have  two  or  more  groups  determine 

the  cut-off  scores.    Consistency  of  their  ratings  can  be  studied, 
'  and  when  necessary,  differences  can  be  studied,  and  a  consensus 

\  decision  reached. 


ERIC 


382 


If  some  past  test  performance  data  are  available,  it  can  be 
used  to  make  some  modifications  to  the  cut-off  scores.  On 
some  occasions,  instead  of  modifying  cut-off  scores,  decisions 
can  be  made  to  spend  ihore  time  in  instruction  to  try  and  im- 
prove test  performance.    If  past  group  performance  on  an 
objective  is  substantially  better  than  the  cut-off  score,  less 
time  may  be  allocated  to  teaching  the  particular  objective. 

As  test  data  become  available,  percentage  of  "masters11  and 
"non-maste#rg"  on  each  objective  should  be  studied.    If  per- 
formance /m  some  objectives  appears  to  be  "out  of  line," 
an  explanation  can  be  sought  by  a  consideration  of  the  test 
it§ms  (perhaps  the  test  items  are  invalid),  the  level  of  the 
cut-off  score,  variation  in  test  performance  across  classes, 
a  consideration  of  the  amount  of  instructional  time  allotted 
to  the  objective  and  so  on* 

10.  Whenever  possible  I  try  to  compare  the  mastery  status  of 
o  uuinstructed  and  instructed  groups  of  examinees.  Instructed 

groups  ought  to  include  mainly  "master"  students.    The  unin- 
structed  groups  should  include  mainly  the  "non-masters."  If 
many  students  are  being  misclassif led,  a  more  valid  cut-off 
score  can  sometimes  be  obtained  by  moving  it  (for  example, 
see  Berk,  1976).  ■ 

11.  It  is  necessary  to  re-review  cut-off  scores  occasionally. 
Curriculum  priorities  change  and  so  do  instructional  methods. 
These  shifts  should  be  reflected  in  the  cut-off  scpres  that 
are  used.  ' 

There  are  many  important  questions  needing  to  be  researched.  These 

techniques  have  apparently  been  used  very  little  (there  /is  certainly 


much  more  literature  on  how  to  set  standards  than  on  what  happens  when 
one  does);  we  need  to  know  the  effects  of  involving  different  groups,  of., 
people  in  the  standard-setting  (especially  parents  as  opposed  to  others) , 
of  the  number  of .  people  involved,  the  information  and  instructions  pro- 
vided  and  the  frequency  of  standard  setting.    How  do  these  factors  effect 
the  levels  set,  the  public  acceptability  of  the^  chosen  standard,  and  are 
the  procedures  cost-effective?  / 


/ 

/ 

l 

i 


8. 


9. 


183 


! 


6.9.3    Basic  Skills  Testing  for  Annual 

Promotion  and  High  School  Graduation 

<. 

These  are  clearly  areas  where  greater  importance  is  attached  to  the 
consequences  of  testing  and,  hence,  more  resources  will  be  allocated  than 

& 

for  classroom  testing.    The  discussion  is  limited  here  to  testing  of 

"minimal"  competencies,  not  intending  that  the  procedures  be  applied 

I 

to  the  ^total  curriculum.     Further,  we  are  not  discussing  the  "life  skill" 
or  "survival"  competencies;  in  setting  standards  for  these  skills  it  is 
necessary  to  consider  performance  on  criterion  measures  of  life  success. 
We  feel  that  this  undertaking  is  beyond  the  capabilities  of  educational 

m 

and  measurement  practice.     It  will  be  difficult  enough  to  decide  upon  and 
assess  "minimal"  skills.     For  these  skills,  since  no  external  criterion 

measures  can  be  said  to  exist,  thfi  appropriate  performance  data  to 

/ 

consider  in  standard  setting  are/ scores  on  the  actual  tests  (or  items). 
We  agree  with  those  (e.g.,  JaegJr,  1978;  Linn,  1978;  Shepard,  1976)  who 
hold  that  performance  data  should  be  considered  along  with  test  content 
to  inform  the  setting  of  standards.    While  from  an  idealistic  point  of 
view  it  would  be  desirable  to  set  standards  with  reference  only  to  the  * 
content  of  a  domain,  in  reality  the  degree  of  . skill  in  test  construction 
required  for  the  pure-content  approach  is  probably  beyond  human  attain- 
ment.    In  order  to  avoid  unpleasant  shocks  it  would  seem  goo^l  practice 
.to  examine  test  performance  data;  the  other  benefit  of  so  doing  is  that 
feedback  is  received  on  our  content-based  judgments  and  may  thus  refine 
our  skills. 

Jaeger  (1978)  has  provided  an  excellent  guide  to  implementing  a 
procedure  involving  representative  groups  affected  by  standards  set  for 
high  school  graduation.    The  method  was  discussed  earlier,  but  a  brief 

384 


review  at  this  point  seems  useful.     In  general  terms,  it  is  an  iterative 
procedure  for  soliciting  item-by-item  judgments  from  groups  of  judges.  • 
Information  fed  back  to  the  judges  at  each  iteration  includes  (a)  group 
performance  on  each  test  item  in  a  pilot  administration,   (b)  the  per- 
centage  of  students  who  would  have  passed  given  several  different  stand- 
ards,  and  (c)  a  distributioi   of  the  standards  suggested  by  the  judges  in 
the  group.     The  median  passing  score  for  each  type  of  judge  is  computed, 
and  the  lowest  of  the  medians  taken  as  the  standard. 

The  principal  attraction  of  plans  such  as  Jaeger* s'  and  the  one  out- 
lined im  Section  6.9.2,  which  is  based  on  Jaeger* s,  is  their  political 
viability.     By  involving  a  broad  cross-section  of  constituents  in  the 
setting  of  the  standard,  one  increases  the  acceptability  of  that  standard. 
However,  no  actual  control  or  very  significant  influence  over  the  educa- 
tional process  is  transferred  to  the  constituency;  the  objectives  and  the 
test,  after  all,  are  presented  to  them  as  livens,  and  their  contribution 
in  setting  the  standard  is  really  quite  limited.    Moreover,  the  consensus 
method,  while  probably  notx»harmf ul ,  may  not  produce  results  that  make  any 
pedagogical  sense.    Where  obtaining  popular  support  is  not  a  critical 
problem,  educators  may  prefer  to  rely  upon  the  judgments  of  subject- 
matter  and  measurement  "experts"  to  set  standards.     This  may  produce  a 
more  coherent,  %i  less  universally-accepted,  result.     Such  a  procedure 

coul#  be  implemented  as  follows  (the  steps  would  be  executed  for  each 
\ 

J 

cuDject  matter  area  by  consent  experts,  working  wxth  measurement  experts): 

1.     Categorize  the  educational  objectives  or  competencies 
asbeingof  the  knowledge/information  type  or  of  the 
rule-learning  type  (this  distinction  corresponds  to 
Meskauskafs  (1976)  continuum  vs.  state  mastery  models). 

In  the  first  case  it  makes  sense  to  speak  of  a  domain  score,  and  to  sample 

randomly  from  the  iomain  to  estimate  that  score.     In  the  second^  since 

c  385 


-66- 

learning  is  presumed  to  be  all-or-none,  sampling,  considerations  are  not 
relevant,  but  construction  of  a  few  test  .items  that  accurately  reflect  ' 
the  ability  is  critically  important.     Objectives  domains  of  the^irst 
type  reflect  EbelTs  (1978)  notion  of  the  purpose  of  competency  certi- 
fication tests  as  being  efficient  and  accurate  indicators  of  the  level 
of  achievement  in  a  broad  domin,  rather  than  lists  of  specific  compe- 
tencies attained. 

2.  For  objectives  or  competencies  of  the  first  type,  construct 
tests  with  the  aid  of  domain  specifications,   items  matched 
to  the  domain  specifications,  and  a  suitable  item  sampling 
plan. 

• 

3.  Ebel's  standard-setting  thethod  (or  one  of  the  other  content- 
focused  methods)  may  then  be  used  to  set  the  standard  for 
these  parts  of  the  test.     To  use  Ebel's  method  the  items 
from  all  of  the  knowledge/information  (or  continuum)  domains 
would  be  considered  together .     (Table  6.9,3  provides  a  com- 
parison of  six  possible  methods.) 

4.  Pooling, the  judgments  of  all  the  experts'may  present  a 
problem.     Simply  averaging  the  ratings  given  to  each  item 
(on  relevance  and  difficulty)  and/or  the  standards  assigned*, 
to  each  category,  will  probably  not  give  a  very  meaiiingful 
result.     Ideally,   the  experts  will  go  through  a  series  of 
iterations  in  which  they  compare  their  Independent  judgments 
(first  of  trhe  item  categorization  and  next  of  the  standards 
they  assigned  to  each  category),  note  discrepancies,  discuss 
the  rationale  for  each  judgment,  possibly  decide  upon  re- 
visions in  the  test  (this  will  direct  the  procedure  back  to 
Step  2,  to  ensure  that  any  revisions  do  not  distort  the 
test's  domain  representativeness),  and/or  persuade  each 
other  to  change  their  judgments.     Unanimity  might  be  re- 
quired in  order  to  proceed  from  this  step. 

5.  For  those  objectives  or  competencies  classified  as  being  of 
the  "State"  variety,  smaller  sets  of  items  are  required 
since  the  domains  are  more  homogeneous,  but  item  construc- 
tion must  be,   if  anything,  more  painstaking.  Ideally, 
experimental  evidence  would  be  garnered  to  show  that  item 
performance  truly  reflected  the  target  construct. 


6.     Standards  on  these  State-type  objectives  can  be  adjusted 
back  from  100%  using  KmriekVs  (J  971)  technique  if  the 
probabilities  of  Type  1  and  Type  2  classification  errors 
can  be  estimated.     Similarly,  domain  scores  can  be  adjusted 
by  a  Bayesian  procedure  (e.g.,  Hambleton  &  Novick,  1973) 
to  compensate  for  relative  losses  associated  with  the  classi- 
fication errors. 


t 

Table 

6.9.3 

A  Comparison  of 

Several 

Standard  Setting  Methods- 

Judgmental 

Combination 

Question 

N^delsky 

Modified 
Nedelsky    Angof f 

Modified 
Angoff  Ebel 

Jaeger 

Contrasting 
Groups 

Borderline 
Groups 

1. 

Is  a  definition  of  the 
minimally  competent 
individual  necessary? 

Yes 

Yes 

Yes 

Yes 

Yes 

No 

No 

Yes 

2. 

What  is  the  nature  of  the 
rating  task — or  items,  or 
individuals? 

Items 

Items 

Items 

Items 

Items 

Items 

Ind ividuals 

Individuals 

3. 

Are  examinee  data  needed? 

No 

No 

No 

No 

No 

No 

Yes 

Yes 

4." 

Do  judges  have  access  to 
the  items? 

Yes 

Yes 

• 

.  Yes 

Yes 

Yes 

.Yes 

Usually, 
but  donft 
need  to 

Usually 

j 

5. 

Are  the  judgments  made 
in  a.^group  setting  or 
individual  setting? 

Both 

Both 

Both 

Bota 

Both 

Both' 

Individual 

Individual 

387 


'< 

✓ 


a 

388 


-68- 


When  the  tests  are  used  for  .yearly  promotions,  students1,  performance  in 

the  next  grade  can  be  used  as  a  criterion  in  order  to  estimate  the 

—9" 

probabilities  of,  classification  errors. 

Research  is  .needed  on  ways  of  pooling  the  ^judgments  of  several 
individuals,  and  of  incorporating  performance  data  in  primarily  content- 

V 

based  judgments. 

t 

6.9.4    Professional  Licensing/Certification  Testing 
Tests  for  licensing  and  certification  differ  from  the  others  dis- 
cussed  here  in  having  an  external  criterion,  job  performance,  which  the  1  ¥ 
tests  shpuld  predict.     In  addition,  these  tests  are  subject  to  govern- 
mental  regulations  and  court  rulings  on  the  adequacy  with  which  they 
reflect  requisite  job  skills  (<and  nothing  more).     Recent  cout't  decisions 

affirm  that  contend validation  of  a  test  against  the  domain  of  entry- 

r '   3  ft 


9 

ERLC 


level  job  skills  is  sufficient  to  demonstrate  that  Bthe  test  itself  is 
Sair..     tlowever,  any  standard  used  must  also  bear  a  rational  relationship.  * 
to  job  performance. 

4 

One  method  that  will  probably  be  acceptable  in  the  courts  is  to  base 
the  standard  on  experts*  judgments  of  the  importance  of  each  tested 
item  to  adequate  job  performance;  that  is,  to  use  one  of  the  content- 
oriented  methods  to  determine  a  percent  correct  for  passing.  'The  pooled 
judgments  of-  a  large  number  of ^expert  practitioners  would  be  desirable. 

Data  on  test  performance  would  not  be  particularly  useful  in  this 
situation  since  there  is  usually  not  any  pre-existing  knowledge  or 
belief  about  the  distribution  of  job-preparedness  in  the  population. 
Empirical  data  on  criterion  (job)  performance  would  be  useful  were  it 
not  for  the  pervasive  selectivity  of  professions;  to  use  criterion 

38y 


performance  properly  in  establishing  optimal  passing  scores  requires  an 
unselected  population  of  job-hollers •    For  the^e  reasons,  content- 
oriented  procedures  for  setting  standards  are  probably  the  most  viable 
procedures  in  licensing  and  certification.  k 


390 


6, 10  Summary 

In  this  unit,  a  number  of  viable  methods  for  setting  standards 

«» 

were  introduced!     If  you  wish  to  view  the  test  by  itself  and  not  in 

<• 

relationship  to  other  variables,  either  Angofffs  method  or  Nedelsky's  *-> 

ft" 

method  appears  to  be  useful.     If  empirical  data  is  available,  Berk's 
method  or  the  Constrasting  Groups  method  seens  especially  useful.  We 
have  also  discussed  other  methods,  of  a  more  complex  nature,  that  are 

suitable  for  setting  criterion-referenced  standards.    Our  preference 

p» 

for  the  methods" mentioned  above  stems  from  the  fact  that  they  are" 
simple  to  implement,  and  appear  to  produce  defensible  results  when 
applied  correctly.      In  the  final  section  of  the  paper,  some  proposed 
sets  of  procedures  for  standard  setting  with  respect  to  three  important 

uses  of  criterion-referenced  tests  were  outlined.    However,  considerably 

&  • 

more  research  must  be  done  before  these  procedures  can  be  recommended 
for  wide-scale  use. 

We  will  conclude  this  unit  with  a  brief  discussion  of  a  very  .im- 
portant problem.     Suppose  a  set  of  test,,  items  have  been  selected.  It 
so,  it  is  then  possible  to  set  standards  via  either  judgmental  or 
empirical  methods » (or  both).    However,  if  a  standard  can  be"  set  via 
reference  to  well-defined  domain  specifications,  and  sample  'test  items, 
tests  which  will  optimally  discriminate  (i.e.,  reduce  the  number  of 

misclassif ications)  in  the  region  of  a  standard  can  be  constructed.  This 

't 

is  done  by  selecting  test  items  which  "discriminate11  in  the  region  of 
the  standard.     Test  items  are  piloted  on  samples  of  examinees  similar  to 
those  who  will  eventually  be  administered  the  tests  to  determine  item 
difficulty  levels  and  discrimination  indices.     Items  with  p  values  near 


391 


-71-  ! 


the  standard  .and  with  the  highest  discrimination  indices'  are  selected 
for  ttie  test. Whether  judges  can 'reliably  set  standards  from  only  domain' 

» 

specifications  and  some  sample  terft  Sterns  4s  unknown.    Also,  'it  is  not' 
known  if  standards  set  by  thesd  two  different  methods  will  produce 
different -results.    This  is  one  of  those  situations  where  similar 
results  across  two  methods  would  be  highly  desirable.  > 


6.11  References 


-72- 

-  :  -l 


Andrew,  B.  J,  ,•  &  Hecht,  J.  T.  •  A  preliminary'  investigation  of  two  pro- 
cedures for  setting  examination  standards.     Educational  and 
Psychological  Measurement,  1976,  36,  35-50. 

^ngof#,  W.  H.     Scales,  norms,  and  equivalent  scores.     In  R.  L.  Thorndike 

(Ed.),  Educatipnal  me.asurement^  Washington,  D.C.:  American  Council 
on  Education,  1971. 

Berk,  R.  A.   -Determination  of  optimal  cutting  scores  in  criterion- 

'  referenced  measurement1;4    Journal  of  Experimental  Education,  1976, 

Block,  J.  H.     Student  learning  and  the  setting^of  mastery  performance 
standards.     Educational  Horizons,  1972,  50,  183-190. 
 '   •  •  . 

Burton,  N.     Societal  standards. '  Journal  of  Educational  Measurement,  1978, 
15,  263-271.  •» 

Conaway,  L.  E.     Discussant  comments:     Setting  performance  standards  based 
on  limited  research.     Florida  Journal  of  Educational  Research,  1976, 
18,  35-36. 

(1 

Conaway,  L.  E.     Setting  standards  in  competency-based  education:  Some 
current  practices  and  concerns.     Paper  presented  at  the  annual  # 
meeting  of  NCME,  New  York,  1977. 

Ebel,  R.  L.     Essentials  of  educational  measurement.     Englewood  Cliffs, 
NJ:     Prentice-Hall,  .1972. 

Ebel,  R.  L.  -The' case  for  minimum  competency  testing.     Phi  Delta  Kappan, 
April,,  1978,  546-549.  . 

Educational  Testing  Service,  Report  on  a  study' of  the  use  of  the  National 
Teachers  Examine  tion 'by  the  State  of  South  Carolina.     Princeton,  NJ:- 
Educational *  Testing  Service,  1976. 

Emrick,  J.  A.     An  evaluation  model  for  mastery  testing.     Journal  of 

Educational  Measurement,  1971,  8,  321-326. 

— .  ■ — i  — — -  — 

Glaas,  G.  V.     Standards  and  criteria.     Journal  of  Kdiicat ional  Measurement , 
1978,  15,  237-261.  (a) 

Glass,  G.  V.    Minimum  competence  and  incompetence  in  Florida.     P/hi  Delta 
Kappan,  1978,  59,  No.  9  (May),,  602-605.  (b) 

Hambleton,  R.  K.     On  the  use  of  cut-off  scores  with  criterion-referenced 
tests  in  instructional  settings.     Journal  of  Educational  Measure- 
ment, 1978,  15,  277-290. 


393 


Hambleton,  R.  K.,  &  Eignor,'  D.  R.    Competency  otest  development ,  valida- 
tion, and  standard-setting.     In  R.  Jaeger  &  C.  Tittle  (Eds.)» 
Minimum  Competency  testing-, (Approx»  Title)     Berkeley,  CA: 

McCutchant Publishing  Co.,  1979.  ' 

« 

Hambleton,  R.  K.  ,  &  Novick,  M.  R.    Toward  an  integration  of  theory  and 
method  for  criterion-referenced  tests.    Journal  'of  Educational, 
Measurement,  1973,  10,  159-170. 

Hambleton,  R.  K.f  Svaminathan,  H.,  Algina,  J.,  &  Coulson,  D.  B. 

Criterion-referenced  .testing  and  measurement:    A  review  pf, 
technical  issues  and  developments.    Review  of  Educational 
Research,  1978,  48,  1-47. 

Huynh,  H.     Statistical  consideration  of  mastery  scores.  Psychoaetrika, 

1976,  41  i  65-78..     "  „  * 

Jaeger,  R.  M.    Measurement  consequences  of  selected  standard-setting 

models.     Florida  Journal  of  Educational  Research,  1976,  18,  22-27. 

Jaeger,  R.  M.    A  proposal  for  setting  a  standard  on  the  Ngr^h^Carollna  .  - 
High. School  Competency  Test.    Paper  presented  at  the  1978  spring 
meeting  of  the  North  Carolina  Association  for  Research  in  Educa- 
tion, Chapel  Hill,  1978. 

Klausmeier,  H.  J.,  Rossmiller,  R.  A. ,  &  Saily,  M.     Individually  guided 
elementary  education.    New  York:    Academic  Press,  1977. 

iCriewall,  T.  E.    Aspects  and  applications  of  criterion-referenced  tests. 
Paper  presented  at  the  annual  meeting  of  AERA,  Chicago,  1972. 

Livingston,  S.  A.    A  utility-based  approach  to  the  evaluation  of  pass/ 

fail  testing  decision  procedures.     Report  No.  COPA-75-01.  Prince-* 
ton,  NJ:     Center  for  Occupational  and  Professional  Assessment, 
Educational  Testing  Service,  197$. 

Livingston,  S.  A.    Choosing  minimum  passing  scores  by  stochastic  approxi- 
A   mation  techniques.    Report  No.  COPA-76-02.     Princeton,  NJ:  Center 
for  Occupational  and  Professional  Assessment,  Educational  Testing 
Service,  1976. 

Macready,  G.  B. ,  &  Dayton,  C.'  M.     The  use  of  probabilistic  models  in 
the  assessment  of  mastery.     Journal  of  Educational  Statistics, 

1977,  2,  99-120. 

Meskauskas,  J.  A,     Evaluation  models  for  criterion-referenced  testing: 

Views  regarding  mastery  and  standard-setting.  Review  of  Educational 
Research,  1976,  46,  133-158. 


Millman,  J.    Passing  scores  and  test  lengths  for  domain-referenced  measures. 
Review  of  Educational  Research,  1973,  43,  205-216. 


•   '  .  •  ■  V  

'  '  ' 

Nassif,  P,,M,     Standard-settings  for  criterion-referenced  teacher -licens- 
ing tests.    Paper  presented  at  the  annual  meeting  of  NOME,  Toronto, 
1978.' 

Nedelsky,  L.    Absolute  grading  standards  for  objective  tests.  Educational 
and  Psychological  Measurement,  1954,  1^,  3^19. 

NSvick,  M.  R. ,  &  Lewis,  C.    Prescribing  test  length  for  criterion-' 

referenced  measurements.     In  C.  W.  Harris,  M.  C.  Alkin,  &  W.  J. 
Popham  (Eds.),  Problems  in  criterion-referenced  measurement # 
Monograph  Series  in  Evaluation,  No.  3.    Lo^ Angeles:    Center  for 
the  Study  of  Evaluation^  University  of  California-,  1974, 

4  4 

Novick,  M.  R. ,  Lewis,  C,  &  Jackson,  P.  H.  '  The  estimation  of  proportions 
in  m  groups.    Psychpmetrika,  1973,,  38,  19-45. 

Popham,  W.  J.     Setting  performance  standards.    Los  Angeles:  Instructional 
Objectives  Exchange,  1978.  *  ' 

i 

Roudabush,  G.  E.    Models  for  a  beginning  theory  of  criterion-referenced 
tests.    Paper  presented  at  the  annual  meeting  of  NCME,  Chicago, 
~  J974. 

Schoon,  C.  G.,  Guliion,  C.  M.\  &  Ferrara,  P.    Or^dfentialing  examinations, 
Bayesian  statistics,  and  the  determination  of  passing^ points. 
Paper  presented  at  the  annual  meeting  of  APA,  Toronto,  1978. 

Shepard,  L.  A.     Setting  standards  and  living  with  them,    Florida  Journal 
of  Educational  Research,  1976,  lfe,  23-32. 

•  «j 

Torshen,  K.  P.    The  mastery  approach  to  competency abased  education.  New 
York:    Academic  Press,  1977. 

Van  der  Linden,  W.  J.,  &  Mellenbergh,  G.  J.     Optimal  cutting  scores  using 
a  linear  loss  function.    Applied  Psychological  Measurement,  1977, 
1,  593-599. 

Zieky,  M.  J.,  &  Livingston,        A.    Manua}  for  setting  standards  on  the 

Basic  Skills  Assessment  Tests.  Princeton,  NJ:  Educational  Testing 
Service,  1977. 


ass 


;  Additioit&l  References 

J 

Block,  J.  H.    Standards  and  criteria:    A  response.    Journal  of  Educational 
Measurement,  1978,  15,  291-295. 

Brennan,  R.  L. ,  &  Lockwood,  R.  E.    A  comparison  of  two  cutting  score 

procedures  using ^gerieralizability  theory.    ACT  Technical  Bulletin 
,    No.  33.    Iowa  City,  Iowa:    American  College  Testing  Program,  4979. 

Eignor,  D.  -R.    Psychometric  and  methodological  contributions  to  criterion- 
referenced  testing  technology.    Unpublished  doctoral  dissertation, 
University  of  Massachusetts,  Amherst,  1979. 

Etnrick,  J.  A.    An  evaluation  model  for  mastery  testing \    Journal  of  ^ 
Educational  Measurement,  1971,  J3,  321-326. 

o 

Levin,  H.  M.    Educational  performance  standards:    Image  or  substance? 
Journal  of  Educational  Measurement,  1978,  15,  309-319.  ) 

Linn,  R.  L.    Demands, ~ cautions,  and "suggestions  for  setting  standards. 
Journal  of  Educational  Measurement,  1978,  15,  301-308. 

Popham,  W.  J.    As  always,  provocative.    Journal  cf  Educational  Measurement 
1978,  15,  297-300.  . 

Scriven,  M.    How  to  anchor  standards.     Journal  of  Educational  Measurement, 
1978,  15,  273-275. 


396 


Unit  7 


Criterion-Referenced  Test  and  Test 
Manual  Evaluations1 


•   '      Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

•3 

and 

Daniel  R.  Eignor 
Educational  Testing  Service 


March  15,  1979 


Iportions  of  this  unit  are  from  Hambleton,  R.  K.>03«rd Eignor , 
D.  R. ,  Guidelines  for  evaluating  criterion-referenced  tests  and  test 
manuals.    Journal  of  Educational  Measurement,  1978,  15,  3Z1-3Z/. 


39V 


.  Table  of  Contents 

Page 

7 . 0  Overview  to  the  Unit .  •  •  1 

5 

7.1  Introduction  

7.2  A  Proposed  Set  of  Guidelines   * 

7.3  Evaluation  of  Eleven  Criterion-Referenced  Tests    10 

7.4  Concluding  Remarks  19 

7.5  A  State  System  to  Evaluate  Criterion-Referenced  Tests  ...  20 

37 

7.6  References  


ERIC 


398 

..     .  J. 


7.0    Overview  to  tlte  Unit  5 

The  scope  and  number  of  criterion-referenced  tests  available  to 
potential  users  is  impressive.    Unfortunately,  t-he  quality  of  these- 

tests  varies  tremendously  and  so  it  is  very  important  for  potential 

e 

users  to  carefully  review  available  tests  before  making  their  selections 

The  primary  purpose  of  this  unit  is  to  propose  a  set  of  guide- 
lines for  evaluating  criterion-referenced  tests  and  test  manuals. 
The  guidelines  should  be  useful  to  both  users  and  developers  of 
criterion-referenced  tests.      Secondary  purposes  are  (1)  to  report  on 
our  use  of  the  guidelines  with  eleven  commercially  available  criterion- 
referenced  test  batteries,  and  (2)  to  briefly  describe  a  State  system 
to  evaluate  criterion-referenced  tests. 


399 


-2- 

f 

%  - 

7.1    Introduction  t  ,  a 

Most  of  the  major  test  publishers  have  published  in  the  last  few 

years   a  wide  assortment  of  criterion-referenced  tests.    In  addition, 
many  school  districts,  state  agencies,  small  testing  firms,  and  con- 
suiting  firms  have  produced  their  own  criterion-referenced  tests. 
Criterion-referenced  tests  are  designed  to  address^  many  problem 
areas.  For  example;  criterion-referenced  tests  are  being  used  to 

monitor  student  progress  through  school  programs,  to  diagnose  learning 
disabilities,  to  report  student  progress  to  parents,  to  evaluate  various 

*  ■         -  — 

types  of  programs,  and  to  certify* or  license  professionals  in  many 
fields.    Unfortunately,. it    appears    to    us,     and  to  many  users  of 
criterion-refgrenced  tests  we  have  spoken  with,  that  many  of  the  available 
tests  fall  short  of  the  technical  quality  necessary  for  them  to  accomplish 
their  intended  purposes.    Perhaps  one  explanation  is  that  many  criterion- 
referenced  tests  were  developed  before  an  adequate  testing  technology  was 
fully  explicated.     Fortunately,  there  now  exists  ai\  adequate  technology 
for  constructing  criterion-referenced  tests  and  using  criterion- 

refetenced  test  scores      (Hambleton,  Swaminathan,  Algina, 
Coulson,  1978;  Popham,  1978) .    Another  possible  explanation  is  that 
there  has  been  &  shortage  of  guidelines  for  constructing  and  using 
criterion-referenced  tests.    Certainly  the  well-known  Test  Standards  for 

400 


evaluating  tests  and  test  manuals  prepared  by  a  joint  committee  of  AERA/ 
APA/NCME  is  helpful,  but  it  is  not  completely  applicable  to  criterion- 
referenced  tests.    Besides  the  incompleteness  of  the  AERA/ APA/NCME 
Tctst  Standards  for  evaluating  criterion-referenced  tests  and  test 
manuals,  what  relevant  information  there  is,  is  scattered  through  75 
pages  or  so  of  other  materials  appropriate  for  norm- referenced  test 
evaluations.    Therefore,  the  Test  Standards  in  its  present  form,  is 
not  very  useful  for  individual©  interested  in  evaluating  criterion- 
referenced  tests. 

i 

In  the  next  section  of  this  unit,  we  will  propose  a  set  of  guide- 
lines  for  evaluating  criterion-referenced  tests  and  test  manuals.  The 
guidelines  should  be  useful  to  both  users  and  developers  of  criterion- 
referenced  tests.    Test  standards  are  not  offered   (an  example  of  a 
standard  is,  "test  score  reliability  must  exceed  .80"),  but  we  do  offer  a 
set  of  questions  for  consideration  by  potential  users  and  developers 
of  criterion-referenced  tests.    The  only  other    efforts  we  are  aware  of 
to  develop  guidelines  for  evaluating  criterion-referenced  tests  and  test 
manuals  are  Pophara  (1978,  Chapter  8);  Swezey  and  Peavlstein  (1975),  am 
Walker  (1977).      In  this  unit,  we  will  also  report  on  our  use  of  the 
guidelines  with  eleven       commercially  available  criterion-referenced  test 
batteries. 

One  caution  and  one  continent  seem  appropriate  to  introduce  at  this 
point.    The  guidelines  represent  our  own  biases  about  what,  is  important 
technical  information  for  users  to  have  in  making  informed  decisions  about 
the  quality  of  criterion-referenced  tests. 


401 


9  * 


-4- 


\ 

7.2    A  Proposed  Set  of  Guidelines 


The  list  of  guidelines  was  generated  by  placing  ourselves  in  the 
*    role  of  potential  purchasers  of  a  criterion-referenced  test,  and  askiug 
"What  questions  would       want  to  answer  before  making  a  decision  to  use 
a  criterion-referenced  test  in  a  particular  situation?"  Questions 
were  organized  around  ten  Broad  categories.      They  are:  Objectives, 

v 

\ 

Test  Items,  Administration,  Test  Layout,  Reliability,  Cut-off  Scores, 
Validity,  Norms,  Reporting  of  Test  Score  Information,  and  Test  Score 
s  Interpretations.    The  questions  are  as  follows: 

\ 

Objectives 

A.l    Is  the  purpose  (or  purposes)  of  the  test  stated  in  a  clear 
and  concise  fashion? 

A. 2    Is  each  objective  clearly  written  so  that  it  is  possible 
to  identify  an  "item  pool"? 

A. 3    Is  it  clear  from  the  list  of  objectives  what  the  test 
measures? 

A. A    Is  an  appropriate  rationale  offered  for  including  each 
objective  In  the  test?  . 

A. 5    Can  a  potential  user  "tailor"  the  test  to  meet  local 

needs  by  determining  which  objectives  from  a  pool  of  objec- 
tives offered  by  the  publisher  are  to  be  measured  by  the  test? 

A. 6    Is  there  a  match  between  the  content  measured  by  the  test 
and  the  situation  where  the  test  is  to  be  used? 

A. 7    Are  individuals  identified  who  were  responsible  for  the 
preparation  of  objectives? 

/         A. 8    Does  the  set  of  objectives  measured  by  the  test  serve  as  a 
representative  set  from  some  content  domain  of  interest? 


ERIC 


402 


B.  Test;  Items 

B.l    Is  the  item  review  process  described? 

B.2    Are  the  te&t  items  valid  indicators  of  thet  objectives 
they  were  developed  to  measure? 

B.3    Is  the  set  of  test  items  measuring   an    objective  repre- 
sentative of  the  "pool"  of  items  measuring  the  objective? 

B.4    Are  the  items  free  of  technical  flaws?  \ 

B.5    Are  the  test  items  in  an  appropriate  format  to  measure 
the  objectives  they  were  developed  to  measure? 

B.6  Are  the  test  items  free  of  bias  (for  example,  sex,  ethnic, 
*J       or  racial)? 

Q 

B.7    Was  a  heterogeneous  sample  of  examinees  employed  in 
piloting  the  test  items? 

* 

B.8    Was  the  item  analysis,  data  used  only  to  detect  ,,flawed,l 
items? 


Ci  Administration 

t 

C.l    Dd  the  test  directions  include  information  relative  to 
test  purpose,  time  limits,  practice  questions,  answer 
sheets,  and  scoring?  1 

f  '  v 

C.2    Are  the  test  directions  clear? 

C.3    Is  the  test  easy  to  score? 

C.A    Does  the  test  manual  specify  an    examiner's  role  and 
responsibilities? 

4 

I 


D.    Test  Layout 

D.l    Is  the  layout  of  the  test  booklets  attractive? 

D.2    Is  the  layout  of  the  test  booklets  convenient  for  examinees? 


0 

403 


-6- 


r 


E.  Reliability 


E.l    Id  the  type  of  reliability  information  of fared  in  the  test 
manual  appropriate  for  the  intended  use  (or  uses)  of  the 
scores? . 

E.  2    Was  the  sample  (or  samples)  of  examinees  used  in  thp 

reliability  study  adequate  in  size,  and  representative 
of  the  population  for  whqm  the  test  is  intended? 

o   E.3    Are  test  lengths  suitable  to  produce  tests  with  desirable 
levels  of  test  score  reliability? 

.  E.4    Is  reliability  information  offered  in  the  test  manual 
for  each  intended  use  (or  uses)  of  the  test  scores? 

* 

F.  Cut-Off  Scores 

F.l    Was  a  rationale  offered  for  the  selection  of  a  method  for 
determining  cut-off  scores? 

* 

F.2    Was  the  procedure  for  implementing  the  method  explained, 
,  and  was  it  appropriate? 

F.3    Was  evidence  for  the  validity  of  the  chosen  cut-off  score 
(or  cut-off  scores)  offered? 


G.  Validity 

G.l    Does  the  validity  evidence  offered  in  the  test  manual 
address  adequately  the  intended  use  (or  uses)  of 
scores  obtained  from  the  test? 

G.2    Is  an  appropriate  discussion  of  factors  affecting  the 
validity  of  test  scores  offered  in  the  test  manual? 


H.  Norms 

t   H.l    Are  the  norms  data  reported  in  an  appropriate  form? 

H.2    Are  the  samples  of  examinees  utilized  in  the  norming  study  described 

H.3    Are  appropriate  cautions  introduced  for  proper  test 
score  interpretations? 

•  404 


-7- 


I9    Reporting  of  Test  Score  Information 

1.1  Are  the  test  scores  reported  for  examinees  on  an  objec- 
tive by  objective  basis?  . 

1.2  Are  there  multiple  options  available  to  the  user  for. 
reporting  of  test  results  (for  example,  by' class  and 
grade  within  a  school)? 

1.3  Are  convenient  procedures  available  for  scoring  tests  by 

hand,  and  forms  available  for  reporting  test  score  information? 


J.    Test  Score  Interpretations 

J.l    Are  suitable  cautions  included  in  the  manuail  f ©reinter- 
preting individual  and  group  objective  score  information? 

J. 2    Are  appropriate  guidelines  offered  in  the  manual  for 
utilizing  test  scores  to  make  descriptive  statements, 
instructional  decisions,  program  evaluation  decisions, 
or  other  stated  uses  of  the  test  scores? 

A  review  form,  keyed  to  the  39  guidelines  offered  above,  is  pre- 

c 

sented  on  the  next  four  pages. 

The  necessity  for  many  of  the  guidelines  is  obvious.    For  others, 
brief  rationale  statements  are  offered  below; 

A. 4.    Rationale  statements  for  tho  inclusion  of  particular  objfcctivet* 
in  a  test  is  especially  important  in  competency-based  certifi- 
cation.   For  example,  a  manual  we  saw  recently  reported  that 
the  test,  "was  designed  to  measure  the  skills  in  reading  and 
mathematics  necessary  for  effective  participation  in  today's 
complex  society."    Potential  users  of  the  test  ought  to  know 
the  process  by  which  skills  or  objectives  measured  by  the  test 
were  selected  or  identified. 

A. 5.    Many  users  desire  to  have  flexibility  in  the  objectives 
included  in  their  tests. 

A. 6.    Essentially  the  problem  is  one  of  determining  content  validity. 
If  there  is  some  flexibility  in  objective  selection,  it>  is 
easier  to  obtain  content  valid  tests  for  specific  uses. 


ERIC 


<C5 


V 


-8-  

Criterion-Relurenced  Test  and  Test 
Manual  Evaluation  Form 


3/15/79 


Background  Information 
Test  Name: 


Test  Publisher: 


Year  of  Publication; 


Forms  and  Levels: 
Author (s) : 
Cost: 


Reusable  Booklets:       Yes     No  Time  Limits: 

Special  Test  Administration  Conditions:  

Manual  and  Other  Technical  Aids: 


A  ....  ~~ 


For  each  ot*  the*>  questions  below  there  are 
four  possible  answers:  "Acceptable", 
"Unacceptable",  "Unsure",  and  "Not 
Applicable",    Place  a  "/"  in  the  column 
corresponding  jto  your  answer  to  each 
question. 

Question 


Ratings 


Comments 


A.l.  Is  the  purpose  (or  purposes)  of 
the  test  stated  in  a  clear  and  con- 
cise fashion? 


A. 2.  Is  each  objective  clearly  written 
so  that  it  is  possible  to  identify 
an  "item  pool"? 


A. 3.  Is  it  clear  frpra  the  list  of  ob- 
jectives what  the  test  measures? 


A. 4.  Is  an  appropriate  rationale 

offered  for  including  each  objective 
in  the  test? 


A.  5.  Can  a  user  "tailor"  the  test  to 
meet  local  neeoc  by  selecting  objec- 
tives from  a  pool  of  available  ob- 
jectives? 

A. 6.  Is  there  a  match  between  the 
content  measured  by  the  test  and 
the  situation  where  the  test  is  to 
be  used? 


9 

:RLC 


406 


-9- 


For  each  of  the  questions  below  there  are 
four  possible  answers:    VAcct.pl  ..il»le", 
"Unacceptable",  "Unsurf",  and  "Not:  ■ 
Applicable".    Place  a  "/"  in  the  column 
corresponding  to  your  answer  to  each 
question. 

Question 

A. 7.  Are  individuals  identified  who 
were  responsible  for  the  preparation 
of.  objectives? 


A. 8.  Does  the  set  of  objectives  mea- 
sured by  the  test" serve  as  a  repre- 
sentative set  from  some  content 
domain  of  interest? 


B.l.  Is  the  item  review  process 
described? 


B.2.  Are  the  test  items  valid  indica- 
tors of  the  objectives  they  were 
developed  to  measure? 


B.3.  Is  the  set  of  test  items  measuring} 
an  objective  representative  of  the 
"pool"  of  items  measuring  the 
objective? 


B.4.  Are  the  items  free  of  technical 
flaws? 


B.5.  Are  the  test  items  in  an  appro- 
priate format  to  measure  the  objec- 
tives they  were  developed  to  measure? 


B.6.  Are  the  test  items  free  of  bias 
(for  example,  sex,  ethnic,  or  racial)?! 


[B. 7 .  Was  a  heterogeneous  sample  of 
examinees  employed  in  piloting  the 
test  items? 


| B . 8 .  Was  the  item  analysis  data  used 
only  to  detect  "flawed"  items? 


ERJC 


IC.I.  Do  the  test  directions  include  in- 
formation relative  to  test  purpose, 
time  limits,  practice  questions,  an- 
swer sheets,  and  scoring? 


407 


-10- 


For  Viach  of  the  questions  below  there  are 
four  possible  answers:    "Acceptable",  , 
"Unacceptable",  "Unsure",  and  "Not 
Applicable''^  Place  a  "/"  in  the  column 
corresponding  to  your  answer  to  each 
question. 


Question 


C.2.  Are  the  test  directions  clear? 


C.3.  Is  the  test  easy  to  score? 


C. A.  Does  the  test  manual  specify  an 
examiner's  role  and  responsibilities? 


D.l.  Is  the  layout  of  the  test  booklets( 
attractive? 


D.2.  Is  the  layout  of  the  test  booklets 
convenient  for  examinees? 


pE.T.  Is  the  type  of  reliability  infor- 
mation offered  in  the  test  manual 
appropriate  for  the  intended  use  (or 
uses)  of  the  scores? 


E.2.  Was  the  sample  of  examinees  ade- 
quate in  size,  and  representative  of 
the  population  for  whom  the  test  is 
intended? 


E.3.  Are  test  lengths  suitable  to  pro- 
duce tests  with  desirable  levels  of 
test  score  reliability? 


E.4.  Is  reliability  information  offered 
in  the  test  manual  for  each  intended 
use  (or  uses)  of  the  test  scores? 


F.l.  Was  a  rationale  offered  for  the 
selection  of  a  method  for  determining 
cut-off  scores? 


F.2.  Was  the  procedure  for  implementing 
the  method  explained,  and  was  it  ap- 
propriate? 

O    " 

:RIC  —   : 


-11- 


For  each  of  the  questions  btluv  there  are 
four  possible  anuwers:  "Acceptable", 
^Unacceptable",  "Insure",  and  "Not 
Applicable".    Place  a  "/"  in  the  column 
corresponding,  to  your  answer  r.o  each 
question. 

Question 


R.ri  lugs 


ConihiCints 


F.3.  Was  evidence  for  the  validity  of 
the  chosen  cut-off  score    (or  cut-  • 
off  scores)  of fered? 


G.l.  Does  the  validity  evidence  offered! 
in  the  test  manual  address  adequately) 
the  intended  use  (or  uses  of  scores) 
obtained  from  the  test? 


(5.2.^,  Is  an  appropriate  discussion  of 
factors  affecting  the  validity  of 
test  scores  offered  in  tjie  test 
manual? 


H.l.  Are  the  norms  data  reported  in  an 
appropriate  form? 


H.2.  Are  the  samples  of  examinees 
utilized  in  the  norming  study 
described? 


H. 3.  Are  appropriate  cautions  intro- 
duced for  proper  test  score  inter- 
pretations? 


1.1.  Are  *:he  test  scores  reported  for 
examinees  on  an  objective  by  objec- 
tive basis? 


1.2.  Are  there  multiple  options  avail- 
able to  the  user  for  reporting  of 
test  results  (for  example,  by  class 
and  grade  within  a  school)? 


1.3.  Are  convenient  procedures  avail- 
able for  scoring  tests  by  hand,  and 
forms  available  for  reporting  test 
score  information? 


J.l.  Are  suitable  cautions  Included  in 
the  manual  for  interpreting  individual 
and  group  objective  score  information? 


| J. 2.  Are  appropriate  guidelines  offered 
for  utilizing  test  scores  to  accomp- 
lish stated  purposes? 


409 


-12- 


A.7.    Users  ought  to  know  the  qualifications  and -experiences  of  * 
individuals  involved  in  determining  the- objectives  measured 
by  a  test  and  the  process  they  used  in  their  objectives 
selection  work. 

A.  8.    There  appears  to  be  a  tendency  for  some  publishers  to  "slant"  their 

test  coverage  to  objectives  easiest  Co  measure.    Does  the        <  < 
set  of  objectives  measured  by  the  test  provide  adequate 
coverage  of  an  area  of  interest?    This  is  an  important 
question  for  users  to  answer;         tl  '  * 

B.  l.    Rigorous  steps  are  necessary  here.    Popham  (1978),  for 

example,  provides  some  excellent  guidelines  that  involve 
many  item  raters  matching  items  to  the  objectives  the  test 
items  were  written  to  measure. 

\ 

B.2.    This  can  be  determined  through  the  use  of 'any  one  of  several 
*  rating  forms.    Face  validity  evidence  is  not  sufficient. 

B.3.    The  best  evidence  here  is  provided  by  Cronbach's  duplication 
experiment.    Alternately,  judges  can  be  asked  the  question 
directly. 

B.4.    Standard  item  writing  principles  should  be  used  to  assess 

item  quality.  * 

£.1.    Even  when  reliability  data  is  reported  in  a  criterion- 
referenced  test  manual,  it  seldom  is  appropriate  for  the 
intended  use  of  the  test  scores.    Standard  correlational 
approaches  to  reliability  provide  little  relevant  infor- 
mation.   What  is  needed,  if  instructional  decisions  are 
to  be  made,  is  some  indication  of  the  consistency  of  * 
decision-making  over  parallel-fdrms  ox  a  retest  adminis- 
tration.   When  the  test  scores  are  intended  to  serve  as 
domain  score  estimates,  some  indication  of„the  precision 
of  the  estimates  should  beooffered.  t 

E.3.    Most  users  of  criterion-referenced  tests  seem  to  be  unaware, 
of  the  "large  errors"  existing  in  domain  score  estimates 
and  mastery  assignments  with  short  (1  to  5  item)  tests «  » 

E.  4.    Criterion-referenced  test  scores  are  used  in  many  ways. 

Reliability  evidence  for  one  use' (or  In  one  sample)  should 
not  be  assumed  for  ether  uses  (or  in  other  samples) . 

F.  l.    There  are  many  methods  for  setting  cut-off  scores.  A 

rationale  should  be  offered  for  any  one  that  is  selected. 
The  method  should  be  consistent  with  the  definitions  of 
mastery  states  offered  for  sorting  examinees. 


410 


-13- 


,  V. 

F.2. 


F.3. 


G.l. 


G.2. 


H.3. 


1.2. 


1.3. 


Currently  there  is  much  debate  about  setting  Jcut-off  scores. 
STELE  to  ensure  the       J  value  W  «•  — 

method)  is  obtained  ^SS/S  cut-off  score  should 

details  of  the  method  for  determining  the  cut  on  s 

be  clearly  specified, 
outcome  measure) . 

«f  o,h rerioii-ref erenced  test  scores. 
There  are  many  uses  of  criterion  reier  evidence 
If  they  are  being  used  for  descriptive  purpo  , 
of  both  content  and  construct  valid! £  ^uld  *  magtery 
If  the  test  scores  are  used  to  90"  ^m"*    ong  baged  on 
states,  the  relationship  ^^"^Sy  ^tectS  independent 
the  test  scores  and  some  appropriately  sexec 
measure  should  be  reported. 

^  .u<:*~,.«„f-  from  that  encountered  with 
Again,  the  problem  is  no  not  being  compered 

norm-referenced  teste.    Sin «  ^™e9among  eme  publishers  to 
with  one  another,  there  is  a  tendency        •  conditions, 
minimize  the  importance  of  8"nd"""^"^  prepare  norms 
On  the  other  hand  it  is  becoming  more  ^«*°J0Jatsai„i- 

tables  for  """^-"^""^".Uons  will  be  importent. 
ized  test  directions  in  these  situations  wij. 

The  problem  of  norms  with  "l"**™1!^  There  is  one 
about  the  seme  as  with  norm-referenced  tests,  in 
difference :    Criterion-referenced  test  scores  ten 

less  relisble  because  te. .U  b^cLtlon  should 

br-seTnTi^ 

e\e=  a  s  "indTvidiS  ssr-sssiA  *  -  - 

a  problem. 

Users  often  desire  to  heve  their  ^^.^£0?.  district, 
«fl>-ietv  of  ways  (for  example,  by  class,  » 
sex!  race).    Are  these  end  other  options  evaileble. 

When  users  intend  to  score  their  own  *«U  ijU 

to  determine  the  '«J^gT°f jfJ^eSS forms  available 
scoring  be  done  conveniently  I    are  rep 
to  simplify  the  process? 

Hanuels  need  to  stress  the  amount  of  error  that  ^ 
criterion-referenced  te"nf°f "^.positive  and  false-negative 
likelihood  of  a  user  """^/^"^"eidom  seen  a  criterion- 
errors?    From  our  test  score  users 
SrSLTU  SS  ""/estimation  or  mastery  state 


determination. 

41 1 


9 

ERIC 


-14- 


7.3    Evaluation  of  Eleven  Criterion-Referenced  Tests 

Eleven  of  the  more  popular  criterion-referenced  £ests  were  selected 

4 

for  review.    The  names  of  the  tests  and  some  descriptive  information  are 

c 

presented  In  Figure  7. 3.1. 

Our  primary  purpose  was  to  ascertain  the  extent  to  which  these  tests 
met  our  guidelines.    We  have  reported  our  evaluation  of  each  test  relative 
to  each  guideline,  but  the  more  important  information  is  arrived  at  by 
determining  how  well  the  nests  as  a  group  meet  each  of  our  guidelines. 
The  group  information  is  informative  because  it  helps^  to  pin-point  areas  . 
where  commercial  materials  are  in  need  of  revisions  and  further  development. 


412 

ERIC  0 


o 


-15- 


Figure  7.3.1.    Criterion-referenced  tests  selected  for  review.  .. 

•  PubHcation 


Code       Name  of  Test 
1 


1976  Stanford  Diag- 
nostic Mathematics 
Test 

1976  Stanford  Diag- 
nostic Reading  Test* 

Skills  Monitoring 
System-Reading 


4  .  Individual  Pupil 

Monitoring  System- 
Mathematics 

5  .     Individual  Pupil 

-  Monitoring  System- 
Reading 

6  Diagnostic  Mathe- 
matics Inventory 

7  Prescriptive  Read- 
ing Inventory 

8  Diagnosis:  An 
Instructional  Aid- 
Mathematics  and 
Reading 

9  Mastery:  An 
Evaluation  Tool- 
SOBAR  Reading 

10  Mastery:  An 
Evaluation  Tool- 
Mathematics 

11  Fountain  Valley 
Support  System 
in  Mathematics 


Grades 

1-12 
1-12 
3-5 


1-6 


1-8 


1.5-7.5 


K-6.5 


1-6 


K-9 


K-8 


K-8 


Levels, 

4 
4 
3 


8 


10 


Forms 

2 
2 


Date 


.  1 


1976 
1976 
1975 


1974 

1974 
1977 
1977 

1974 


On? 


1975 


1974 


1974 


Publisher 


Harcourt  Brace- 
Jovanovich 

Harcourt  Brace 
Jovanovich 


Harcourt  Brace 
Jovanovich 


"Houghton-Mifflin 


Houghton-Mifflin 

* 

CTB/McCraw- 
Hill 

CTB/McGraw-  . 
Hill 


Science  Research 
Associates 


Science  Research 
Associates 


Science  Research 
Associates 


Hi chard  L.  Zweig 
Associates 


413 


In  Judging  the  quality  of  a  test  and  test  manual  relative  to  each 
guideline,  the  following  rating  sca1<*  was  iiboU: 


A    c  Acceptable 

A"  ■    Acceptable,  with  reservations 

X    «    Unacceptable,  data  offered  was  unsuitable  or 
improperly  used  * 

Y        Unacceptable,* no  data  was  offered 
N    -  Not  Applicable 


Table  7.3.1  summarizes  our  ratings  of  the  11  tests  on  the  39  guidelines, 


Our  most  significant  impressions  of  the  test  and  test  manuals  reviewed 
are  as  follows: 

1.  In  areas  such  as  Administration,  Test  Layout,  and  Horms,  there 
are  few  problems. 

2.  Current  commercially  available  "criterion-referenced  tests" 
reviewed  in  this  paper  should  be  called  "objectives-referenced 
tests"  since  the  tests  appear  to  be  developed  from  behavioral 
objectives  (Popham,  1978).    Starting  to  develop  a  test  from  a 
listing  of  behavioral  objectives  is  less  than  ideal  because 
behavioral  objectives  usually  do  not  lead  to  unambiguous 
definitions  of  the  "item  pools"  keyed  to  the.  behavioral  ob- 
jectives.   The  solution  is  to  write  "domain  specifications" 
(Popham,  1978). 

3.  Only  about  half  of  the  publishers  Included  information  about 
the  qualifications  of  individuals  who  prepared  the  objectives 
measured  by  their  test.    The  qualifications  of  participants 
in  this  aspect  of  the  test  development  process  is  important 
information  for  potential  users. 


'  414 


/ 


-17- 
Table  7.3.1 

Sununary  of  Ratings  of  the  Criterion-Referenced  TosIh 


Test 


Question 

1 

2 

.3 

A 

5 

6 

7 

8 

9 

10  > 

11 

A 

A 

a 

A 

A~ 

A 

t. 

A"" 
A 

A 

a  v 

i\y 

A 

A 

A 

X 

A2 

X 

X 

v 

A 

v 

A 

v 
A 

v 

A 

Y 
d\ 

x 

x 

x 

X 

A3 

a 

A 

a 

A 

A 

A 
A 

A 
A 

A 
i\ 

A 

rk 

A  . 

i 

A 

A 

A 

a 

A — - 

^  A 

-A — 

a 

A 

A  — 

A 

A  — 
A 

A 
i\ 

A 

<r\ 

A 

*» 

A 

A 

X 

AS 

A 

A 

A 

A 

*  A 

A 

A 

Y 
A 

Y 
#■ 

A 

A 

A 

A 

A6 

a 

A 

a 

A 

A 
A 

A 

A 
I\ 

A 
*i 

A 
• » 

A 

A 

A 

A 

a  ^ 

A7 

v 

v 

A  — 
A 

v 
a 

V 
a 

Y 

A  „ 

A- 

A" 

A" 

A8 

A- 

A- 

A- 

A- 

A- 

A- 

A- 

A- 

A~ 

A" 

A- 

HI 

v 

A 

A 
A 

A~ 

A  " 
A 

X 

A" 

Y 

A 

A 

Y 

B2 

a  — 

A 

a  — 

A 

A 

A 

A  — 

A 

A  — 
A 

71 

ft 

A- 
r\ 

A- 
c\ 

A 

A 

A~ 

B3 

.  X 

X 

X 

v 
X 

v 

Y 
A 

Y 

A 

Y 

x 

X 

X 

BA 

A 

A 

A 

A 

A 

A 

A 

A 

A 

A 

A 

B5  v 

A 

A 

A 

A 

A 

A 

A 

A 

A 

A 

A 

B6 

A 

A 

A 

Y 

Y 

o 

I 

v 
i 

v 
I 

v 

JL 

'  a 

■1 

Y 

B7 

A 

A 

A 

A 

A 

A 

A 

Y 

Y 

Y 

Y 

B8 

X 

X 

A 

X 

X 

X 

A" 

Y 

X 

X 

Y 

CI 

A 

A 

* 

A 

A 

A 

I 

A 

A 

A 
A 

A 

A 

?2 

.  C2 

A 

A 

A 

A 

A 

O 

f 

A 
A 

A 
t\ 

A 

A 

WW 

A 

C3 

A 

A 

A 

A 

A 

? 

A 

A 

A 

«>  A 

A 

C4 

A 

A 

A 

A 

A 

? 

A 

A 

A 

A 

A 

Dl 

A 

A 

a 

A 

a 

A 

f 

• 

A 

A 

A 

A 

A 

D2 

A 

A 

A 

A 

A 

? 

A 

A 

A 

A 

A 

El 

k  mm 

A 

v 

A« 
A 

v 
I 

v 

a 

x 

X 

Y 

X 

X 

Y 

E2 

A 

A 

A 

Y 

Y 

A 

A 

Y 

A 

A 

Y 

-  E3 

A" 

A" 

A" 

A" 

A" 

X 

X 

X 

X 

X 

A" 

EA 

A" 

A" 

A~ 

Y 

I 

v 

A 

Y 
A 

v 
I 

x 

x 

cm 

Y 

Fl 

A 

A 

A 

Y 

A" 

Y 

A 

X 

A 

A 

Y 

F2 

A 

A 

X 

Y 

Y 

X 

X 

Y 

A 

A 

Y 

F3 

A 

A 

a  —*! 

A 

Y 

I 

v 

A"" 
A 

Y 

A" 

A" 

Y 

G2 

A 

A 

A 

X 

X 

A 

A 

X 

A" 

A" 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

Y 

HI 

A 

A 

A 

M 

N 

A" 

A 

N 

N 

N 

N 

H2 

A 

A 

N 

N 

N 

1 

Y 

N 

N 

N 

N 

H3 

A 

A 

N 

N 

N 

Y 

Y 

N 

N 

N 

N 

11 

A 

A 

A 

A 

A 

7 

.  A 

A 

A 

A 

A 

12 

A 

A 

A 

A 

? 

A 

A 

A 

A 

A 

13 

A 

A 

A 

A 

?  ' 

A 

A 

A 

A 

A 

Jl 

A" 

A~ 

A 

Y 

Y 

1 

A" 

Y 

A" 

A" 

Y 

J2 

A 

A 

A 

X 

X 

1 

A 

A" 

A~ 

A~ 

A 

9 

ERIC 


lWe  did  not  have  the  proper  materials  to  assess  the  quality  of  the  tost 

in  the  areas  marked  by  a  "?". 
2-rhe  information  was  on  a  cassette.    We  did  not  listen  to  the  tape  ana  to 

we  were  not  in  a  position  to  rate  this  aspect  of  the  tc.it. 


-18- 


Since  test  developers  have  not  used  "domain  specifications"  -% 
it  is  impossible  to  assess  "item  representativeness".  Item 
representativeness  is  essential  if  users,  desire  to  use  ob- 
jective scores  to  "generalise  to  the  domains-  of  behaviors 
defined  by  the  objectives."    If  item  representativeness  is 
not  established,  scores  can  only  be  interpreted  in  terms  of 
the  specific  items  included  in  the  test. 

"Item  analysis"  is  an  area  in  which  there  are  two  problems: 
(a)  Too  little  explanation  is  offered  of  the  choice  of  parti- 
cular item  statistics  and  of^the  specifics  of  item  statistics  , 
usage,  and  (b)  item  statisSd**  are  used  in  test  construction 
thereby    biasing"  the  content  validity  of  the  test  in  unknown^ 
ways . 

Test  score  reliability  was  not  handled  very  well  in  most  of 
the  manuals.    Either  (a)    inappropriate  information  relative 
to  the  stated  uses  of  the  test  scores  was  offered,  or  (b) 
information  was  offered. 


Cut-off  scores  are  typically  offered,  but  there  is  no  rationale 
offered  for  setting  cut-off  scores.  Procedures  used  for  setting 
cut-off  scores  are  not  explained,  nor  is  any  evidence  offered 
for  the    validity"  of  cut-off  scores  (for  example,  do  those 
examinees  classified  as  "masters"  typically  perform  better  than 
non-masters    on  some  appropriately  chosen  external  criterion 
measure?). 

Factors  affecting  the  validity  of  scores  are  not  offered  in 
any  of  the  manuals. 

Only  a  few  of  the  manuals  introduced  the  notion  of  "error" 
in  test  scores.    It  is  extremely  important  for  users  to  have 
some  indication  of  the  "stability"  of  their  objective  scores 
and/or    consistency  of  mastery/non-mastery  decisions". 


416 


-19- 


7.4    Concluding  Remarks 

9 

Our  proposed  guidelines  were  developed  after  careful  study  of  the 
criterion-referenced  testing  literature  and  the  Test  Standards.  However, 
they  are  offered  here  only  to  serve  as  a  "catalyst"  for  further  dis- 

'V 

cuss ion  and  debate  on  a  topic  of  considerable  importance  to  the  test  and 
measurement  field.    Our  use  of  the  proposed  guidelines  to  evaluate  eleven 
criterion-referenced  tests  was  intended  to  (I)  demonstrate  that  the  proposed 
guidelines  were  workable,  and  (2)  highlight  areas  where  considerably  more 
(or  different)  work  on  the  purt  of  tust  developers  Is  needed. 


ERIC 


417 


-20- 

7.5    A  State  System  to  Evaluate  Criterion-Referenced  Teats 

George  Madaus  and  Peter  Airasian  from  Boston  College  and  the  senior 
author  of  the  Practitioner's  Guidebook  prepared  a  test  evaluation  system 
for  the  Commonwealth  of  Massachusetts  through  which  content  and  measure- 
ment°  specialists  can  determine  the  appropriateness  of  commercially  avail- 
able criterion-referenced  tests  for  meeting  the  Commonwealth's  Basic 
Skills  Improvement  Policy  (BSIP) .    The  BSIP  is  the  Commonwealth's  version 
of  a  state-wide  minimum  competency  testing  program.    The  program  covers 
the  areas  of  reading  and  mathematics.    School  districts  must  participate 
in  the  program  in  one  of  three  ways:     (1)  use  the  Commonwealth's  tests, 
(2)  construct  and  use  their  own  tests,  or  (3)  use  one  of  the  commercially 
available  criterion-referenced  tests  which  meet  the  Commonwealth's  content 
and  technical  criteria.    Our  efforts  were  directed  toward  the  third  use. 
We  developed  a  test  evaluation  system  which  includes  rating  forms, 
directions  for  content  and  technical  evaluations,  checklists,  and  summary 
evaluation  sheets.    ,The  evaluation  system  is  being  used  within  the 
Commonwealth  to  determine  which  tests  meet  the  Commonwealth's  content 
and  technical  criteria  and  therefore  can  be  chosen  by  school  districts 

o 

for  use  in  complying  with  the  state's  minimum  competency  testing  law. 

i 

On  the  next  few  pages  are  several  documents: 

1.  Standardized  achievement  test  review  form 

2.  Directions  for  test  reviewers  —  content  review 

3.  Directions  for  test  reviewers  —  technical  review 
A.    Mathematics  skills  checklist 

5.    Standardized  achievement  test  evaluation  summary  sheet. 


9 

ERIC 


418 


The  materials  are  tailored  to  the  BSIP.    However,  they  should  help 
others  who        have  the  task  of  developing  test  evaluation  systems  for 
other  states.    A  report  in  preparation  by  Madaus,  Airasian  and  Hambleton 
will  describe  the  BSIP,  steps  in  the  development  and  validation  of  the 
test  evaluation  system!  and  several  examples  of  its  use. 


419 


•  Basic  Skills  Improvement  Policy 
Massachusetts  Department  of 
•  Education 


Standardized  Achievement  Test 
-  Review  Form*  - 


1.      Reviewer   Date  of  Review 


3.      Tost' Name  . 


4.      Test  Publisher 


5.      Publication  Date  

Levels  (Circle  Grade  Levels  Covered  by  the  Test): 

K     1     2     3     4     6     6     7     8     9     10     11  12 
Which  form  of  the  test  is  being  reviewed  ?  


Is  the  test  being  reviewed  for  Reading  Skills  or  Math  Skills  ?  {Circle  one) 
Reading  Math 


If  you  are  doing  a  content  review,  begin  with 
Question  9. 

If  you  are  doing  a  technical  review,  begin  with 
Question  13. 


CONTENT  CONSIDERATIONS 

\ 

9.      How  many  of  the  fourteen  reading  skills  or  thirty-eight  mathe- 
matics skills  of  the  Massachusetts  Basic  Skills  are  measured 
by  at  least  one  item  on  the  test  ? 


10.      Overall,  is  the  reading  level  of  the  items  reviewed  suitable 
for  most  of  the  students  in  the  lowest  grade  covered  by 
this  test?  (Cf.  Question  6  above).  YES  NO 


6. 

7. 
8. 


No.  of  Skills 
%  of  Skills 


1This  review  form  was  prepared  by  Ron  Hambleton,  George  MadauB  and  Peter 
Airasian  to  meet  specifications  required  by  the  Commonwealth  of  Massachusetts  for  use 
in  conjunction  with  the  Massachusetts  Basic  Skills  Improvement  Policy. 


420 


Overall,  are  the  test  items  free  of  offensive  sexual,  cultural 

racial,  and/or  ethnic  content  and/or  stereotyping.    1  yKS  no 

If  you  answered  "NO"  to  question  11,  please  explain  the  reasons  for  your  answer 
including  the  type(s)  of  bias  and  the  item  number  of  any  items  of  concern. 


This  is  the  end  of  the  Content  Review 


TECHNICAL  CONSIDERATIONS 
How  many  alternate  forms  of  this  test  are  available? 

Is  there  a  Technical  Manual  which  includes  information  about  the 
test  regarding  the  following  ten  topics: 

a.  Item  Review  Methods   yFS 

b.  Item  Analysis   YES 

c.  Average  Item  Difficulty   Yes 

d.  Internal  Consistency  Reliability   yES 

e.  Test/Retest  Reliability  #  YES 

f.  Parallel  Form  Reliability   YES 

g.  Standard  Error  of  Measurement    .    .   ,   j   YES 

h.  Content  Validity   YES 

i.  Norms  »••••»,,,,    ••!•••••,,  YES 

j.  Procedures  for  screening  items  for  offensive  sexual, 

cultural,  racial,  and/or  ethnic  content,  anjd/or  stereotyping. .  YES 


No.  of  forms 


NO 
NO 
NO 
NO 
NO 
NO 
HO 
NO  " 
NO 

NO 


A3JI 


-24- 


5. 


15. 


16. 


:i7. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


How  many  of  the  itoms  reviewed  meet  the  standard  rules 
of  item  writing  ? 


Were  item  analysis  results  used  to  identify  "defective" 
test  items? 

Aro  <btn  heating  on  the  consistency  of  'inastery  (tectrtons 
(Jm  one  or  mom  perPrn'monee  standards  or  cist  -oil  ucoftrs) 
reported  in  the  Technical  Manual  ? 


No.  of  items 
reviewed 

No.  of  acceptabl 

items 

%  of  acceptable 
items 


YES     NO  INA* 


YES  NO 


Is  the  consistency  of  mastery  decisions  (for  one  or  more  cut-off 
scores)  reported  in  the  Technical  Manual  equal  to  or  above  90%  ?  YES 


Do  standard  indices  of  internal  consistency  reliability 
reported  on  the  total  reading  score  or  total  mathematics 
score  reach  or  exceed  .  90  ? 

Do  standard  indices  of  test-retest  or  parallel  form  reliability 
as  reported  on  the  total  reading  score  or  total  mathematics 
score  reach  or  exceed  *90? 

If  parallel-forms  of  the  Test  are  available,  do  both  forms 
(or  multiple-forms,  if  available)  measure  equally  well  the 
content  spanned  by  the  skills  Included  in  the  Test  ?  (In 
other  words,  do  the  multiple-forms  of  the  Test  have 
equivalent  content  validity?) 

Are  the  test  score  norms  based  on  data  that  is  no  more  j 
than  five  years  old? 

Were  the  norm  groups  of  sufficient  size  (i.e.,  at  least 
300  students)  ?  \ 

\ 

Were  the  samples  of  students  used  in  the  norming  study 
representative  of  students  in  the  grades  for  which  this 
test  is  intended?  (Cf.  Question  6) 

Were  the  samples  of  students  used  in  the  norming  study 
representative  of  important  strata  within  the  society 
(i.e.,  rural  pupils,  minority  group  pupils,  pupils  In 
larp: e  city  schools,  etc. ) 


NO 


YES  NO 


YES  NO 
YES  NO 


YES  NO 


YES  NO 


INA 


YES     NO  INA 


INA 


INA 


INA 


YES     NO  INA 


INA 


INA 


9 

ERIC 


♦INA  -  Information  not  available 


422 


Are  the  tost  administration  directions  suitable  for  students 
in  the  lowest  grade  covered  by  the  test?  (Cf.  Question  6) 


If  "NO",  please  explain 


Do  the  test  administration  directions  address  thu  matter 
of  time  limits? 

If  "NO",  gjease  explain  


♦ 

Do  the  test  administration  directions  indicate  to  the  student  how 
to  handle  the  problem  of  guessing?  • 

If  "NO",  please  explain  :  


Is  the  layout  or  format  of  the  test  bookie*  convenient  for 
students  in  the  lowest  grade  .covered  by  the  test?  (cf  Question  6) 

If  "NO",  please  explain  ^  ;  .  _ r 


423  - 


Is  the  layout  or  format  of  the  answer  sheet  convenient  for 
students  In  the  lowest  grade  covered  by  the  test  ? 
(Cf.  Question  C) 


If  "NO",  please  explain 


Does  the  test  include  practice  questions  ? 

o 


This  is  the  end  of  the  Technical  review 


424 


ftnslc  Skills  Improvement  Policy 
Massachusetts  Department  of 
Education 


-27- 


Dircctlons  for  Test  Reviewers 
-  Content  Review  - 


The  content  review  you  are  about  to  undertake  involves  three  principal  tasks: 

a.  Deciding  whether  each  of  the  test  items  the  publisher  has  nominated  as 
measuring  each  of  the  fourteen  reading  skills  or  thirty-eight  mathematics 
skills  of  the  Massachusetts  Basic  Skills  Policy  in  fact  is  appropriate 
indicators  of  the  skill  In  question. 

b.  Deciding  whether  overall  the  ruadlng  level  of  the  Hums  on  the  test  is 
suitable  for  the  majority  of  students  in  the  lowest  grade  covered  by  the 
test. 

c.  Deciding  whether  overall  the  test  is  free  of  offensive  sexual,  cultural, 
racial  or  ethnic  content  and/or  stereotyping. 

You  are  asked  to  make  a  determination  on  each  of  these  points  by  completing  the  enclosed 
Review  form.  Three  people  will  review  each  test  and  will  meet  to  arrive  at  a  composite 
rating  for  each  test.  A  separate  technical  review  of  each  test  is  also  being  carried  out. 

To  begin  the  review  you  should  have  the  following  materials  in  front  of  you: 


a.  A  copy  of  the  reading  or  math  tests  to  be  reviewed. 

b.  A  list  of  the  test  items  which  the  test  publisher  feels  correspond  to 
each  of  the  fourteen  reading  skills  or  thirty-eight  mathematics  skills  of 
the  Massachusetts  Basic  Skills  Policy. 

c.  A  skills  checklist  which  lists  the  fourteen  reading  skills  (blue  color) 
_4>iLthiriy_^eightjnath^ 

d.  A  Standardized  Achievement  Test  Review  Form; 

e.  A  Standardized  Achievement  Test  Evaluation  Summary  Sheet  (pink 
color).  ° 


9 

ERIC 


Step_A_.  -    Complete  the  "Basic  Information"  section  of  the  Standardized  Achievement 
~~       Test  Review  Form  (Questions  1  -  8). 

Fill  out  the  background  Information  section  on  the  Skills  Checklist  and  on 
the  Test  Evaluation  Summary  Sheet. 

Step  B.  -    Read  carefully  through  the  list  of  skills  Included  in  the  Skills  Checklist. 

Read  carefully  through  attUie  test  items  on  the  reading  or  mathematics 
test  under  review. 


425 


-28- 


Step  C.  • 


Qj'IPi^yplL?.  JilUJi!?.  Form 


For  each  skill  listed  on  the  Skills  Checklist  road  each  item  which  the 
publisher  hns  nominated  as  a  measure  of  that  skill.  If  you  agree  that  the 
item  is  a  valid  indicator  of  the  skill  in  question,  list  the  item  number  in  the 
space  provided.  Once  you  have  finished  with  a  skill,  count  up  the  number  of 
items  nominated  by  the  publisher  which  you  feel  arc  valid  indicators  of  the 
skills  and  place  the  total  number  in  the  blank  space  provided  on  the  Skills 
Checklist. 

If  at  least  one  item  nominated  by  the  publisher  is  a  valid  indicator  of  the 
skill  in  question  you  should  place  a  "v/"  beside  the  Commonwealth's  skill 

li:ilc»d  on  the  Skills  Checklist  in  IhcHiox  provided. 

After  you  have  completed  your  review  of  each  of  the  nominated  questions 
for  each  of  the  fourteen  reading  skills  or  thirty- eight  mathematics  skills, 
add  up  the  total  number  of  acceptable  items  across  all  the  skills  and  place 
your  total  in  the  space  providedat  the  end  of  the  check  list.  Next  in  the 
space  provided  write  the  total  number  of  items  on  tjie  reading  or  math 
test  reviewed. 

Finally  count  up  the  number  of  Hv/,f  marks  (I.  e.,  each  skill  that  has  at 
least  one  item  you  feel  is  a  valid  indicator  of  that  skill).   Place  the  total 
number  of  Hv/M  in  the  space  provided  in  Question  9  on  the  Review  Form. 
Calculate  the  perdent  of  skills  neasured  by  at  least  one  test  item.  For 
example,  suppose  8  of  the  Commonwealth's  14  reading  skills  are  measured 
by  at  least  one  item  on  a  Test.  You  would  write  f,57M  in  the  space  provided 
beside  Question  9  for  percent  of  skills  included  in  the  test. 


This  item  is  self-explanatory.   Make  your  decision  on  the  basis  of 
your  reading  of  all  the  items  on  the  test.   For  example  if  the  test  is  . 
designed  for  7th,  8th,  and  9th  graders  (indicated  in  Question  G)  the 
reading  level  should  be  appropriate  for  7th  graders. 


Question  11  -  After  reading  through  all  the  items  on  the  test,  decide 
whether  overall  the  test  is  free  of  offensive  sexual,  cultural,  racial, 
and/or  ethnic  content  and/or  stereotyping.  You  should  examine  all  test 
items  to  determine  whether  there  is  a  consistent  or  overriding  pattern 
of  racial,  ethnic,  cultural,  or  sexual  stereotyping  and/or  offensive 
content.  Your  judgment  should  be  made  within  the  context  of  the  total 
test.   The  fact  that  one  or  two  items  portray  a  woman  in  the  kitchen  or 
a  minority  group  member  in  an  unskilled  occupation  does  not  necessarily 
imply  stereotyping*   Some  worn  cm  do  spend  time  in  the  kitchen  and  some 
minority  group  members  do  hold  unskilled  jobs.   At  issue  is  whether  mem- 
of  such  groups  arc  consistently  or  predominantly  portrayed  in  such 
circumstances  relative  to  the  way  in  which  other  groups  are  portrayed. 


Step  D.  -    Question  10  on  the  Review  Form 


Step  K.  -    Questions  11  and  12 


*  -29- 
Questlon  12  -  Self-explanatory. 

Step  F.  -    Transfer  tho  information  from  the  Review  Form  to  the  Test  Evaluation 
Summary  Sheet. 


Thank  you  your  time  and  effort. 


o 

ERIC 


427 


B.tslc  Skills  Improvement  Policy 
Massachusetts  Department  of 
Education 


-30- 


 —  ,  

Directions  for  Test  Reviewers 


-  Technical  Review  - 


The  technical  review  you  are  about  to  undertake  Involves  making  judgments 
about  certain  technical  characteristics  of  tests  which  are  being  consldex*ed  for 
possible  Inclusion  on  a  State-approved  list  of  standardized  commercial  tests. 
Local  school  districts  may  use  a  test  on  the  list  t6  assess  basic  Bkltis  in  rcadlnf? 
and  jiiutlieinatics  at  the  secondly  level  (grades  1-12). 

Three  people  will  review  each  test  and  will  meet  to  arrive  at  a  composite 
rating  for  each  test.  A  separate  content  review  of  each  test  Is  also  being  carried 
out  to  assess  the  test's  content  validity  relative  to  the  Massachusetts  Basic  Skills 

Policy.  ^ 


To  begin  the  review  you  should  have  the  following  materials  In  front  of 


you: 


a.  Copies  of  the  test  to  be  reviewed. 

b.  Copies  of  the  Technical  Manual  for  each  test. 

c.  A  Standardized  Achievement  Tost  Review  Form. 

d.  A  Standardized  Achievement  Test  Evaluation 
Summary  Sheet  (pink  color). 


Stop  A  -  Complete  the  "Basic  Information"  section  of  the  Standardized  Achievement 
Test  Review  Form,  Questions  1-8. 

Fill  out  the  background  information  section  on  the  Test  Evaluation 
Summary  Sheet. 


Step  B  -  Read  carefully  through  the  test  booklets  and  the  Test's  Technical  Manual. 


Stop  c  -  THE  TECHNICAL  REVIEW  BEGINS  AT  QUESTION  13.    Complete  each 
of  the  following  questions  on  the  Review  Form; 


questions  13  and  14  -  Self-explanatory 

ERJC 


-31- 

Queation  15  -  Road  the  technical  aid,  "Multlple-Cholco  Item 
Writing  Principles"  on  page  32,  and  then  randomly  select  and 
review  25%  of  the  test  itoms  to  determine  the  percent  of 
these  test  items  which  do  not  violate  any  of  the  standard  rules 
of  multiple-choice  item  writing.  Write  the  number  of  items 
reviewed,  the  number  of  acceptable  items  and  the  percent 
of  item  reviewed  which  are  acceptable  in  the  spaces  provided 
beside  Question  15  on  the  review  form. 

Question  16  -  Check  to  be  sure  that  item  difficulties  and  item 
discrimination  indices  were  used  in  any  item  analyses'.  (In 
constructing  criterion-refnrencccl  tests,  however,  the  latter 
is  a  more  important  and  useful  statistic. 

INA  means  Information  Not  Available.  0 

Questions  17  and  18  -  Check  for  the  proportion  of  agreement 
in  decision-making  across  parallel-form  or  rotest  administra- 
tions. Alternately,  check  to  see  if  the  statistic,  k,  is  reported. 
It  reflects  the  proportion  of  agreement  over  and  above  agreement 
which  is  due  to  chance  alone. 

Questions  19  and  20  -  The  test  manual  will  most  likely  report 
numerous  reliability  indices.  In  general,  do  these  indices 
reach  or  exceed  .90? 

Question  21  -  Check  to  see  If  the  content  valldlt^of  two  (or  more) 


forms~ts~the  same.  TSften  the  Technical  Manual  will  discuss  con- 
tent emphases  and  summarize  the  relevant  information  in  charts 
or  tables.  If  this  information  is  not  satisfactory  the  parallel 
forms  will  be  reviewed  separately  another  time  by  another 
review  committee. 


Questions  22  and  23  -  Self-explanatory. 

Questions  24  and  25  -  Check  to  see  if  charts  are  produced  to  show 
the  representation  of  any  norms  groups.   Do  they  look  reasonable? 

Questions  26  to  31  -  These  five  questions  are  self-explanatory. 


Step  D  -  Transfer  the  information  from  the  Review  Form  to  the  Test  Evaluation 
Summary  Sheet. 


er|c  mvnm&. 
—  4?q   


32- 


/ 


Multiple-Choice  Item  Writing  Principles 


1.  Is  the  item  stem  clearly  written  for  the  intended  group  of  students? 

2.  Is  the  item  stem  free  of  irrelevant  material? 

3.  Is  a  single  problem  clearly  defined  in  the  item  stem? 

4.  Aro  the  answer  choices  clearly  written  for  the  intonderl  group  of  Htudnnts? 

5.  Are  the  answer  choices  free  of  irrelevant  material? 

6.  Is  there  j&  correct  answer  or  a  clearly  best  answer? 

7.  Have  words  like  "always "none, 11  or  "all"  been  removed? 

8.  Are  likely  student  mistakes  used  to  prepare  incorrect  answers? 

9.  Is  "all  of  the  above"  avoided  as  an  answer  choice? 

10*  Are  the  answer  choices  arranged  in  a  logical  sequence  (if  one  exists)? 

11.  Was  the  correct  answer  randomly  positioned  among  the  available  answer  choices? 


JL2,_ Jfcre_  all_rj^tiJtL^^  removed  from  the--*nswer  choices — 

and  included  in  the  item  stem?  6 

13.  Are  all  of  th*  answer  choices  of  approximately  the  same  length? 

14.  Do  the  item  stem  and  answer  choices  follow  standard  rules  of  punctuation 
and  grammar? 

15.  Are  all  negatives  underlined? 

16.  Are  grammatical  cues  between  the  item  stem  and  the  answer  choices, 
which  might  give  the  correct  answer  away,  removed? 

17.  Are  letters  used  in  front  of  the  possible  answer  choices  to  identify  them? 

18.  Have  expressions  like  "which  of  the  following  is  not"  been  avoided? 


430 


ERIC 


Basic  Skills  Improvement  Policy 
..Massachusetts  Department  of 
\  *  Education 


-33- 


-  Mathematics  Skills  Checklist1- 


Reviewer 
Test  Name 


Date  of  Review 


Place  a  "t/1  beside  those  skills  which  are  measured  by  the  test. 


Mathematics  Skills 


a.    Number  and  Numeration  Concepts 


1.    Recognize  number  symbols  (17,  eighteen),  whole  numbers 2 (34) , 
fractions^  (1/2)  ,  decimals  (3. 75)/ ,  and  powers  of  10  (10  ) . 

List  the  number  of  each  item  which  you  feel  is  a  measure  of  this 
skill. 


Total  number  of  items  for  this  skill 


2. .  Identify  odd  and  even  numbers. 

List  the  number  of  each  item  which  you  feel  is  a  measure  of  this 
skill. 


□ 


rotal  number  of  items  for  this  skill 


3.    Put  numbers  in  numerical  order. 

List  the  number  of  each  item  which  you  feel  is  a  measure  of  this 
skill. 


Total  number  of  items  for  this  skill 


^nly  the  first  page  of  the  mathematics  skills  checklist  is  presented 

here. 


er|c  431 


,  Basic  Skills  Improvement  Policy 
Massachusetts  Department  of 
1  •  Education 


-34- 


I  Sti 

L_ 


Standardized  Achievement  Test 
Evaluation  Summary  Sheet 


Reviewer 


Date  of  Review 


Test  Name 


Check  one  -  Reading 


Math 


Fill  in  your  ratings,  determine  the  points,  and  write  in  the  score  for  each  question  in 
the  space  provided. 


CONTENT  CONSIDERATIONS 


Question 


9 


10 


11 


12 


Rating 


_% 


Point  System 


90-100%-5  points 
80-  89%-4  points 
70-  79%-3  points 
60-  69%-l  point 
<60%-0pointB 

Yes  -  2  points  ~" 
No  -  0  points 

Yes  -  3  points 
No  -  0  points 

No  points 


TOTAL  CONTENT  POINTS 


Score 


□ 


Question 


Rating 


TECHNICAL  CONSIDERATIONS 
 Point  System 


Score 


13 
14 


a 

b" 

c 


e 
f 

g_ 
h 


No  points 

Yes  -  1  point 
No  -  0  points 
for  each  item 
"a"  through  "j" 


432 


a 

b~ 

c 

d~ 

e 

f 

h 


Question 


Rating 


15 


16 


17 


18 


19 


-35- 

Polnt  System 


90-100%-5  points 
SO-  89%-4  points 
70-  79%-3  points 
<  70%-0  points 

Yes  -  3  points 
No  -  0  points 
IMA*-  0  points 

Yes  -  1  point 
No  -  0  points 

Yes  -  5  points 
No  -  0  points 
INA  -  0  points 

Yes  -  5  points 

•  80-.  89-3  points 

•  70-,  79-1  point 
less  than  •  70-0  points 
INA  -  0  points 


Score 


20  „  ■==^  Yes  -  5  points 

•  80-,  89-3  points 
.70-.  79-1  point 
less  than  .  70-0  points 
INA  -  0  points 


21    No  points 

However  if  No  or  INA  then 
alternative  forms  of  the 
test  are  subject  to  a 
separate  review  at  another 
time 

22    Yes  -  2  points 

No  -  0  points 
INA  -  0  points 

23    Yes  -  2  points 

No  -  0  points 
INA  -  0  points 


♦INA  -  "Information  not  available" 


433 


Question 


24 


Rating 


25 


26 


27 


28 


29 


30 


31 


-36 

Point  System 

Yes 

-  3  points 

No 

-  0  points 

INA 

-  0  points 

Yes 

.  3  points 

No 

-  0  points 

TNA 

Yes 

-  2  points 

No 

-  0  ooints 

INA 

-  0  points 

Yes 

-  2  points 

No 

-  0  points 

Yes 

-  2  ooints 

No 

-  0  points 

Yes 

-  2  mints 

No 

-  0  nointfl 

Yes 

-  2  points 

No 

-  0  points 

Yes 

-  2  points 

No 

-  o  points 

Score 


TOTAL  TECHNICAL  POINTS 


□ 


434 


/  • 


-37- 


7.6  References 

\ 

Hambleton,  R.  K. ,  Swaminathan ,  H.,  Algina,  Jl.,  &  Coulsotj,  D.  B. 
Criterion-referenced  testing  and  measurement:    A  Review  of 
technical  issues  and  developments.    Review  of  Educational 
Research,  1978,  £8,  1-47. 


Popham,  W.  J.    Criterion-referenced  measurement .  Englev 
NJ:    Prentice-Hall,  1978. 


ood  Cli'ffs, 


Swezey,  R.  W. ,  &  Pearlstein,  R.  B.    Guidebook  for  developing  criterion 
referenced  tests.    A  report  prepared  for  thu  U.S.  Army  Research 
Institute  for  the  Behavioral  and  Social  Scisnces.  tReston,  VAs 
Applied  Science  Associates,  August,  1975.  |» 

Walker,  C.  B.  Standards  for  evaluating  criterion-referenced  tests. 
Los  Angeles:  Center  for  the  Study  of  Evaluation,  UCLA,  1977. 
(Unpublished  manuscript.)  /  \ 


\ 


/ 


435 


/ 


/ 


'   Unit  8 

Using  and  Reporting  Test  Score  Information 


Prepared  By 

Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

ar.d 

Daniel  R.  Eignor 
Educational  Testing  Service 


9 


j. 


March  15,  1979 


436 


Table  of  Contents 

Page 

8.0  Overview  of  the  Unit   1 

8.1  Introduction   2 

8.2  Uses  of  Criterion-Referenced  Test  Scores   3 

8.3  Domain  Score  Estimation    5 

8.3.1    Introduction  •   5 

*8.3.2    Specialized  and  Bayesian  Estimates    6 

8.4,   Mastery  State  Determination    15 

8.4.1    Introduction.. ....  .»_  . ..» . »  .» •..».»..•_•  .•  — 17 

*8.4.2    Advanced  Decision  Models    25 

*8.5     Simulation  Study  Involving  Criterion-Referenced 

Test  Scores   ^8 

8.6  Reporting  of  Information  .......  47 

8.6.1  Individual  Student  .....   • 

8.6.2  Group   50 

8.7  Grading   63 

8.8  References  •'   68 


Note:      Starred  ("*")  sections  may  be  omitted  without  loss  of 

continuity.    These  sections  involve  the  use  of  Bayesian 
statistical  methods  with  criterion-referenced  test  scores. 


0 


437 


-1- 


8.0    Overview  of  the  Unit 

Procedures  for  using  and  reporting  criterion-referenced 
scores  are  discussed  in  this  unit.  ' 


438 


-2- 

8.1  Introduction 

The  first  five  units  of  the  Practitioner's  Guidebook. covered 
methods  for  developing  and  validating  criterion-referenced  tests. 
In  Unit  6,  issues  and  methods  for  setting  standards  we*e  introduced. 
A  list  of  guidelines  for  evaluating  criterion-referenced  tests  and  test 
manuals  was  presented  in  Unit  7.    Ways  in  which  test  scores  obtained 
through  applications  of  criterion-referenced  tests  can  be  used  and 
reported  will  be  considered  in  this  unit.    First,  we  discuss  several 
uses  of  criterion-referenced  test  scores.    Second,  we  will  discuss 
and  provide  examples  of  the  ways  in  which  criterion-referenced  test 
score  information  can  be  reported. 

In  sum,  in  this  unit,  we  hope  to  give  the  reader  some  practical 
information  on  ways  to  use  and/or  report  criterion-referenced  test 
score  information.    In  the  next  unit,  we  will  extend  our  discussion 
presented  here  to  the  application  of  criterion-referenced  tests  in 
two  popular  instructional  models. 


4313 


ERIC 


8.2    Uses  of  Criterion-Referenced  Test  Scores 

Millman  (1974)  delineates  four  uses  of  criterion-referenced  test 
scores! 

1.  estimation  of  domain  scores  and  allocation  to  mastery  states 
in  instructional  settings, 

2.  evaluation  of  programs, 

c 

3*  needs  assessment  purposes, 

4.  teaching  improvement  and  personnel  evaluation* 
The  focus  of  the  material  to  be  discussed  in  the  remainder  of  Unit  8  will 
be  on  the  first  use  listed*    Hambleton  and  Gif ford  have  prepared  a  paper 
that     discusses     the  use  of  criterion-referenced  tests  in  program  eval- 
uation. (After  final  editing,  this  paper  willbe  included  in  these  instruc- 
tional materials.)  Besides  Millman  (1974),  Popham  (1975)  also  has  a  discus 
sion  of  the  uses  of  criterion-referenced  tests  for  program  evaluation. 
In  reference  to  program  evaluation,  Millman  (1974)  states: 

One  consideration  in  the  evaluation  of  instructional 
programs  is  the  degree  to  which  the  objectives  of  the 
program  have  been  met.    DRT's  [domain  referenced 
tests]  are  designed  to  present  such  information. 
Further,  in  contrast  to  national,  norm-referenced 
tests,  tests  referencing  the  specific  doiffMns  of 
learner  behaviors  to  which  the  instructional  ef- 
fort   is  directed  have  a  better  chance  of  detecting 
areas  in  which  the  program  has  been  successful  or  is 
in  need  of  modification. 

In  reference  to  the  use  of  criterion-referenced  tests  in  assessing 
needs,  it  is  helpful  to  first  discuss  the  meaning  of  need.  According 
to  Millman  (1974),  "A  need  can  be  defined  as  the  difference  between 
expected  and  actual  status.11    In  other  words,  a  discrepancy  exists  be- 
tween  present  status  and  what  is  expected.    However,  before  movement 

440 


4- 


can  be  initiated  for.  change,  the  present  status  of  the  area  to  be  changed 
(for  instance,  an  instructional  program)  must  be  determined.    A  criterion- 

« 

referenced  test  is  most  useful  in  establishing  present  status. 

it 

\    In  reference  to  teaching  improvement,  Millman  (1974)  states: 

When  student  performance  is  measured  by  DRT's  [domain 
referenced  tests] ,  the  desired  student  behavior  be- 
comes explicit.    The  precise  boundaries  of  the  be- 
havior to  be  assessed  are  defined,  and  criteria  for 
judging  the  adequacy  of  learner  responses  are  identi- 
fied.   Such  information  makes  it  possible  for  the 
teacher  to  devise  more  relevant  instructional  materials 
and  provides  for  a  fairer  evaluation  of  the  teacher's 
performance. 

In  the  sections  to  follow,  we  will  focus  attention  on  two  uses  of 
criterion-referenced  test  scores  in  instructional  settings.    These  uses 
are  (1)  estimation  of  examinee  domain  scores,  and    (2)  allocation  of 
examinees  to  mastery  states. 


441 


9 

ERIC 


8.3    Domain  Score  Estimation 
8.3.1  Introduction 

In  this  section,  the  basic  problems  iuvolved  with  domain  score 
estimation  are  introduced.    Then  we  will  discuss  specialized  procedures 
for  estimating  domain  scores  (section  8.3.2).    The  discussion  in  8.3.2 
has  been  taken  from  a  paper  by  Hambleton,  Swaminathan,  Algina  and 
Coulson  (1978). 

We  assume  that  a  test  is  constructed  by  randomly  sampling  items 
from  a  well-defined,  or  clearly  specif ied* domain  of  items  measuring 
an  instructional  objective.    If  the  test  measures  more  than  a  single 
objective,  the  items  must  be  randomly  sampled  from  the  domain  of  items 
measuring  each  objective.     (An  examinee  has  a  domain  score  defined 
for  each  objective  measured  by  the  test.) 

In  problems  of  domain  score  estimation,  it  is  common  to  use  an 
examinee's  test  score  (or  proportion-correct  score)  as  an  estimate 
of  the  domain  score  of  that  examinee.    An  examinee's  domain  score  is 
his/her  true  level  of  performance  in  the  domain  of  items  measuring  the 
objective.    Of  course,  there  will  be  error  involved  in  using  the 
observed  test  score  as  an  estimate  of  the  domain  score,  and  that  is 
why  specialized  methods  involving  Bayesian  procedures  have  been 
developed.    The  estimates  derived  from  these  procedures  are  more  \ 
precise  (i.e.,  contain  less  error)  estimates  of  the  domain  score. 

When  using  the  test  score,  or  some  other  derived  (i.e.,  Bayesian) 
estimate,  to  estimate  an  examinee's  domain  score,  error  can  be  defined 
as  the  difference  between  the  estimated  and  true  value  (i.e.,  domain 


442 


score).    The  test  developer  thus  wants  to  choose  as  his/her  estimate 
the  one  that  minimizes  these  differences  over  the  group  of  individuals 
tested. 

As  mentioned  above,  the  simplest  and  most  obvious  estimate  of  an 
examinee's  domain  score*  which  is  denoted  rt±  for  the  ith  examinee,  is  his/h< 
observed  proportion-correct  score,  denoted  it^.    This  estimate  is  ob- 
tained by  dividing  the  examinee's  test  score,        (the  number  of  items 
answered  correctly),  by  the  total  number,  n,  of  items  measuring 
the  objective  included  in  the  criterion-referenced  test.    Although  th 
proportion- correct  score  is  an  unbiased  estimate  of  domain  score,  this| 
estimate  is  highly  unreliable  when  the  number  of  items  on  which  the  \ 
estimate  is  based  is  small.    For  this  reason,  specialized  procedures 
that  take  into  account  other  available  information  in  order  to  pro- 
duce more  precise  estimates,  especially  when  there  are  only  a  few  items 
on  the  test  measuring  an  objective,  are  used.    In  section  8,3.2,  we 
discuss  a  number  of  such  estimates. 

*8.3.2    Specialized  or  Bayesian  Estimates  of  Domain  Score 
The  estimates  discussed  in  this  section  utilize  additional 
information  besides  an  individual's  proportion-correct  score  to 
arrive    at      examinee  domain  score  estimates.    However,  to  obtain  these 
estimates  requires  the  use  of  a  small  computer  to  carry  out  the  some- 
what complicated  calculations. 


Classical  Model  II  Estimate 

One  of  the  first  attempts  to  produce  an  estimate  of  an  examinee's  true 
score  using  the  information  obtained  from  the  group  to  which  an  examinee  \ 
belongs  was  made  by  Kelley  in  1927.    This  is  the  well-known  regression 
estimate  of  true  score  (Lord  and  Novick,  1968,    p.  65),  which  is  the  weighted 
sum  of  two  components  —  one  based  on  the  examamineefs  observed  score  and 
the  other  based  on  the  mean  of  the  group  to  which  an  examinee  belongs, 
Jackson  (L972)  modified  this  procedure  for  use  with  binary  data,  by 
employ  Ink  the  Freeman-Tukey  transformation,  #iven  by 


-  A  <-<.-»  f 


gj  =  ^  (sin  x        \    __±_    +  sin 

n+1 


x.+l 

-±        )     .  (1) 

n+1 


As  a  result  of  this  transformation,     a    domain  score  is  transformed  onto 

/ 

where, 


Yi  '  Sin_1      yj    *i      '  <2) 

When,   .15  C  i\  <  .85,  and  the  number  of  test  items  (n)  is  at  least  eight, 
the  distribution  of  g.  is  approximately  normal  with  a  mean  approximately 
rq.Ml  to  the  transformed  domain  score,  Yr  and  known  variance 

v  =  (4n  +  2)"1  . 

The  classical  model  II  estimate  becomes,  in  terms  of  y» 

Y    -  lg.  i  +  (4n  +  2)"1  g.]  /     U  +  (An  +  2)~l]   ,  O) 


444 


-8- 


ERIC 


wile rt!  g.,thu  sample  mean  based  on  a  sample  of  N  examinees,  is  given  by 


N 

g.  -  if 1  Z    g.  ,  <*) 
1-1  X 


and  4),  Che  sample  variance  of  Che  y's,  is  given  by 


-l  N  2  ..-1 

4>  -  (S  -  1)      :    (g.  -  g.)    -  C^n  +  <-) 

i«l  1 


Once  y.  is  obtained,  rt±  is  determined  from  Che  expression 

£    -  (1  +  .5/n)  sin2  y±  -  .25/n. 


(6) 


For  a  decailed  discussion  of  Chis  estimace,  the  reader  is  referred 
Co  Novick  and  Jackson  (1974,    p.  352)  and  Novick,  Lewis,  &  Jackson  (1973). 

Bayesian  Model  II  EstimaCe 

The  classical  model  II  estimate  given  above  may  not  be  ideal  since.it  does  not 

take  into  account  any  prior  informacion  thaC  may  be  available.  In 
addiCion,  iC  may  happen  ChaC  *  esCimated  using  Equation  (5)  is  negative,  tn 
which  case  the  solution  will  not  be  meaningful.    Novick  et_al.  (1973)  utilizing 
the  transformations  given  by  Equations  (1)  and  (2),  obtained  a  Bayesian  solu- 
tion for  the  estimation  of  domain  score  that  not  only  takes  into 


445 


-9- 


\ 


'    account  the  direct  and  collateral  information,  but  alo>;  any  prior  infor- 
mation that  may  be  available.    Direct,  information  is  provided  by  an  examinee's 
test  score;  collateral  information  is  contained  in  the  test  performance  of 
other  examinees;  prior  information  on  an  examinee  may  come  from  past  test 
performance  or  the  examinee's  performance  on  other  objectives  measured  by 
the  test.     In  addition,  the  Bayesian  model  II  estimation  procedure  avoids  the 

problem  of  negative  estimates  for 

Since  the  distribution  of  z±  has  known  variance  but  unknown  mean 
V.,  the  distribution  of  g±  is  customarily  expressed  as  a  conditional  dia- 
tribution,  i.e. , 

(g±  I  Y±)  *  N(Yi,  v)  '  (7) 

where  N(Y±,  v)  represents  the  normal  distribution  with  mean  y±  and 
variance  v.    The  Bayesian  estimates  are  based  on  the  revised  belief 
about  the  parameters  after  the  data  are  obtained.    The  revised  belief 
about  the  parameters  after  the  data  are  obtained  is,  summarized  in  the 
form  of  the  posterior  distribution  of  the  parameters. 

In  order  to  obtain  the  posterior  distribution  of  ^,  it  is 
necessary  to  specify  the  prior  knowledge  about  the  distribution  of  Yl. 
or  f(Yl,Y2,...,YN).    m  order  to  do  this,  it  is  assumed  that  the  trans- 
formed domain  scores  Y1»Y2»...»YN  are  exchange- 
able.   This  amounts  to  saying  that  the.  prior  belief  about  one  Yt  is  no 
different  from  the  belief  about  any  other  Yj  and  implies  the  assumption 
that  y    is  a  random  sample  from  some  distribution.    In  particular,  it  is 
assumed  that  the x prior  distribution  of  Yi  *«  normal  with,  unknown  mean  u 
and  unknown  variance  *.    Thus,  the  specification  of  the  prior  distribu- 
tion of  Y,  is  dependent  upon  the  knowledge  of  the  mean  a  and  the  variance 
*.    However,  Novick,  fital .  (1973)  h/ve  suggested  that  the  prior  belief 

erJc  44R 


-10- 

about  a  may  not  be  as  important  as  the  specification  of  the  prior  belief 
about  4*  and  may  be  represented  by  a  uniform  distribution.    The  above 
autnors  have  further  assumed  that  it  is  reasonable  to  represent  the 
belief  about  <J>  by  an  inverse  chi-square  distribution  with  v  degrees 
at  freedom  and  scale  parameter  X  (see  Novick  and  Jackson,  1974,  for 
an  extensive  discussion  of  this  distribution).    Specification  of  the 
prior  belief  about  <f>  thus  requires  the  specification  of  only  the  *w> 
parameters,  v  and  A, m  „ 

Novick,  et  al.  (1973)  hav^  considered  in  detail  the  problem  of 
setting  values  of  the  parameters,  v  and  X.    Based  on  various  considera- 
tions, these  authors  recommend  setting  v  ■  8,    The  mean  <J>,  of  the  in- 
verse  chi-square  distribution  is  given  by  X  /  (v-2) ,  and  once  v  Is 
known,  X  can  be  set  equal  to  (v-2)  J.    To  estimate  $  it  iu  necessary 
to  indicate  the  amount  of  information  that  is  available  about  tt.  This 
is  accomplished  by  specifying  a  value  M,  where  M  is  considered  to  be 
the  it  value  of  the  typical  examinee  in  the  sample.    The  next  step  is 
to  specify  the  number  of  test  items;  n,  that  would  have  to  be 
administered  to  the  examinee  in  order  to  obtain  as  much  information 
about  7i  a±>  is  deemed  to  be  available.    Now,  transformed  estimates  of 
n,  from  a  n-item  test  are  distributed  normally  on  the  Y-inetric  with 
variance  (4n  +  2)  \    Hence,  (4n  +  2)  ^  can  be  taken  as  an  estimate 
of  $  and  subsequently  \  can  be  specified. 

Specification  of  v  and  X  in  essence  determines  the  prior  distri- 
bution f(y)  #  Yji  Y2f*i  YN» 

Uovick, et  al.   (1973)  obtained  the  joint  posterior  distribution  of  the 
parameters,  and  hence  the  joint  modal  estimate  of 

The  joint  modal  estimate  Y.  is  obtained  by  solving  the  equation 


9 

ERIC 


447 


-11- 


L    N  ±  v  -  1      J   'M  +  2)J 

[    N  +  v  -  1      j+U  4-  2)" 


(8) 


where 


■-1  N 

N     £    Yj      •  (9) 
1=1 


This  equation  for  y^  has  to  be  solved  iteratively,  and  has  been  found 
(Novick,  et  al.    1973)  to  yield  a  satisfactory  solution  after  only  a 
few  iterations. 

Marginal  Mean  Estimate 

The  Bayesian  model  II  estimate  discussed  above  is  useful  for 
making  joint  decisions  about  a  set  of  N  examinees.    However,  in  cri- 
terion-referenced testing  situations,  a  separate  decision  for  each 
examinee    has     to  be  made  and  hence  separate  or  marginal  estimates 
of    domain  scores  are  required. 

Lewis,  Wang,  and  Novick  (1973)  obtained     marginal  mean 

estimates  of  domain  scores.   They  are  given  by  the  expression 

y±  -  g.  +  p*^  -  g.)    .  (io) 

The  quantity  p*  ip  dependent  on  the  parameters  v  and  A  and  on  the 
data;  once  the  parameters  are  setf  p*  can  be  read  directly  from 

Ces  prepared  by  Wang  (1973).    Ap,ain,  once  Yi  i«  obtained  i« 
determined  using  Equation  (6). 


448 


-12- 


The  marginal  mean  estimate  given  aboVe  is  based  on  the  assump- 

«  / 

ti'on  that  no  prior  information  is  available  on  a,  i.e.,  the  prior 
(liHlribuUon  of   a   is  uniform.    Moro  riWnt  lyv  Lewis,  W/uift  and  Nov  Irk 
(1975)  relaxed  this  assumption  by  assum^n^that  a  is  normally  dis- 
tributed with  mean   0  and  variance  <J>/n.    In  thfcs^case,  they  showed  that 
the   estimate  y±  is  given  by 


Yi  =  P8i  +  <T-P>8.  +  (1_T>  9  '  \ 

Since,  the  definitions  of  p  and  T  are  rather  involved\  we  refer 
the  interested  reader  to  Lewis,  et  al.  (1975)  for  a  discussion  of 
these  quantities  and  for  the. procedure  required  to  specify  the,  ^ 
additional  parameters,     6  and  n  .  \  \ 

"Quasi"  Bavesian  Estimates 

In  obtaining  the  joint  modal  estimates  and  the  marginal  ^ean 
estimates,  Novick,  et  al.     (1973)  and  Lewis,  et  al.    (1973)  assumed 
that  the  prior  beliefs  about  a  and  <fr  could  be  expressed  in  the  form 
of  distributions.    There  are  several  variations  to  this  theme.  If 
instead  of  specifying  the  prior  beliefs  in  the  form  of  distributions, 
values  for  a  and  *  can  be  specified  on  the  basis  of  previous  exper- 
ience, then  the  expressions  corresponding  to  the  Bayesian  marginal 
mean  estimates  are  readily  obtained.        These  estimates  are  rela- 
tively easy  to  compute, 

449 


!  V  .  •  !   /  Y    !  •  '  

i     •  ' ""Via-  7" 

\      ■  . 

!  These  estimates  are  based  on  the  prior  specification  of  a  and  <J>. 
Specification  of  o  introduces  relatively  few  complications,  but  the 
exact  specification  of  *  poses  a  problem.  ^This  is  not  a  quantity 
most  practitioners  are  familiar  with.    However,  the  interrogation 
procedure  described  by  Novick  and  Jackson  (1974)  can  be  effectively  used 
to  yield  this  information.    Two  assumptions  are  made  in  deriving  these 
quasi-Bayesian  estimates:    ;1)  The^'pr^or  belief  about  o  can  be  expressed 
as  a  uniform  distribution,  and  *  can  j>e  specified  exactly,  and, 
(2)    both  o  and  *  can  be  specified  exactly.    In  the  first  case,  it 

can  be  shown  that  the  marginal  mean  estimate  y±  is  given  by  ^ 

gi  $  +  (4n!+2)_1  g.  .    t  (12a) 

h  ■  rri4ii+2)-A~] 


In  the  second  case,  the  marginal  mean  estimate,  y±t  becomes 


I" 


gt  <t>  +  (4n+2)    a  ,   (12b) 

?i  "       *  +  (4n+2)"1 

The  similarity  between  the  marginal  mean  estimates  (12a)  and  (12b)    and  the 
classical  model  II  estimate  given  by  Equation  (3)  is  obvious.     In  fact,  it  is 
interesting  to  note  that  the  classical  model  II  estimate  is  in  reality  an 
empirical.  Bayes  estimate  obtained  by  using  sample  estimates  for  a  and  0. 

Itambleton,  Hutten,  and  Swaminathan  (197  6)  investigated  the  compar- 
ative efficiencies  of  the  various  estimates  given  bv  Equations  3,  8, 
10,   I.'. i,  and  H»<>  proport  lon-ior  reel  score  estimate  vi;i  a  simulation 


450 


-14- 

study  (see  section  8.5).    Factors  under  consideration  in  their  study 
were  sample  size,  test  length,  homogeneity  of  the  domain  score  distri- 
bution, specification  of  prior  information  and  cut-off  score  (or 
performance  standard).    Their  conclusions  indicated  that  when  precise  , 
information  is  available,  i.e.,  when  <J>  and  a  can  be  specified,  the 
marginal  mean  estimate,  y±t  given  by  Equation  (12b),  had  the  smallest 
absolute  error  as  defined  by1  the  expression 

N 

'  e  »    I  |  >.  -  yJ  • 

l-l'    i  i 

When    a     cannot  be  specified  exactly,  the  estimate  given  by  f 
Equation  CL2a)  produced  the  next  best  result  in  terms  of  minimizing  ev 
The  other  estimates,  ranging  from  third  best  to  poorest  were:  Marginal 
mean  estimate  given  by  Equation  (10}  classical  model  II  estimate  given  by  • 
Equation (3i  the  joint  modal  estimate  given  by  Equation  (8*  and  the  propor- 
t. ion-correct  scorn  estimate.    However,  in  most  cases  «  and  <J>  cannot  be 
specified  exactly,  and  hence,  the  results  of  this  study  bear  out  the  ex- 
pectation that  Bayesian  estimation  procedures  are  the  most  efficient  in 
die  estimation  of  domain  scores.    Also,  it  should  be  pointed  out  that  these 
authors  did  not  study  the  estimate  given  by  Equation  (lU    We  can,  never- 
theless, conclude  that  the  estimate  given  by  Equation  (11)  would  be  at 
least  as  accurate  as  that  Riven  by  Equation  (10) if  the  Assumption  of  a 
normal  prior  on  a  is  valid. 

451 


-15- 


8.4   Mastery  State  Determinations 

In  this  section,  the  basic  situation  involved  when  a  criterion- 

0 

referenced  test  score  is  being  used  to  allocate  an  individual  to  a 
mastery  state  is  introduced.    Then,  as 'in  section  8.3,  some -advanced 
decision  models  and'  a  "Bayesian  procedure  for  making  examinee  assign- 
ments  to  mastery  states  are  discussed.    The  material  is  presented  iri 
section  8.4,2.  < 

*  * 

Before  discussing  advanced  procedures  for  allocating  examinees 
to  mastery  states,  it. will  be  useful  to  review  a  section  first  en- 
countered in  Unit  4  :  Types  of  errors  (false  positive 
and  false  negative)  made  when  classifying  individuals  into 
mastery  states.    The  following  two-fold  table  of  losses  associated 
with  decisions  can  be  constructed: 


Domain  Scores 


Decision 


Advance 


Retain 


s' 


\ 


452 


Where  it    ■  the  examinee's  domain  score 

a    -  loss  associated  with  advancing  a  student  whose  domain 
score  it  is  <irQ  (false  positive  error) 

b    -  loss  associated  with  retaining  a  student  whose  domain 
score  it  is  £itq  grilse  negative  error). 

The  values  of  a  and  b  are  specified  by  the  test  constructor.  One  possible 
decision  is  to  let  a  -  b,  and  this  might  be  done, according  to  Novick  and 

Lewis  (1974),  when 

.  .  .it  were  no  more  serious  to  advance  a  student 
whose  level  was  below  the  criterion  than  to  retain 
a  student  who  was  above,  .  ..  •  • 

From  the  specification  above,  a  general  decision  rule  can  be 

generated.    The  rule  is  to  advance  (assign  to  a  "mastery  state")  an 

examinee  if 

b  [Prob  (tt*tt0|  data)]  >  a [Prob  (tt<tto  |  data)  ] 
and  retain  (i.e.,  assign  to  a  "non-mastery  state"),  otherwise.  An 
equivalent  comparison  is  to  compare  the  loss  ratio  -|  to  the  ratio 

Prob  (ir>iT0|data)  '  • 

Prob  (tt<tt  I  data} 
o 

If  #  is  less  than  the  above  ratio,  the  examinee  should  be  advanced.  If 

b 

J  is  greater  than  the  above  ratio,  the  examinee  should  be  retained. 
Prob  (tt£tto)  is  the  probability  of  an  examinee  having  a  domain  score 
equal  to  or  above  the  cut-off  score.    The  probability  is  obtained  as  a 
part  of  a  Bayesian  analysis  of  the  examinee's  test  performance. 


453 


8,4,1  Introduction 
,       A  second  major  use  of  scores  obtained  from  criterion-referenced 
tests  is  to  assign       examinees  to  mastery  states.    In  view  of  the 
discussion  in  section  8.3,  it  may  appear  tempting  to  first  estimate 
an  examinee9 s  domain  score,  compare  it  to  one  or  more  cut-off  scores 
defined  on  the  domain  score  scale,  and  then,  for  example,  in  the  case  of 
two  mastery  states,  classify  the  examinee  as  eithe^  a  master  or  a 
non-master.    Typically,  this  is  the  strategy  adopted  by  individuals 
implementing  objectives-based  instructional  programs.  Unfortunately, 
this  approach  is  not  usually  very  satisfactory.    One  reason  is  that  users 

must  assume  all  classification  errors  (whether  they  be  of  the  "false- 

* 

positive"  or  "false-negative"  type)  to  be  equally  serious  (i.e.,  a  ■  b). 
This  is  an  unreasonable  assumption  to  make  in  many  instructional  settings. 
For  example,  with  instructional  objectives  that  are  prerequisites  to  more 

advanced  ones  in  a  curriculum,  false-positive  errors  (moving  examiness 

i  i 

ahead  before  they  are  ready)  may  be  far  more  serious  than  false-negative 
errors  (holding  examinees  back,  even  though  they  may  have  "mastered"  the 
objectives  in  question).     (One.  possible  solution  is  to  raise  a  cut-off 
score  when  false-positive  errors  are  more  serious  than  false-negative 
errors.    When  the  importance  of  the  errors  is  reversed,  a  cut-off  score 
can  be  lowered.)    Also,  dpmain  score  estimates  may  be  obtained  using  a 
loss  function  completely  inappropriate  for  that  associated  with  making 
decisions.     In  assigning  an  examinee  to  a  mastery  state,  an  error  can 
occur iwhen  an  examinee    is  assigned,  based  upon  his/her  test  score  (or 
a  variation  of  it),  to  a  mastery  state  other  than  his/her  true  mastery 
state.     For  example,  the  individual  may  truly  have  mastered  the  material, 
but  based  upon  his/her  score  on  the  test,  be  assigned  non-mastery  status. 


454 


i 


x\  •  -18- 

^^H^re.the  notion  of  error  used  in  domain  score  estimation  (squared  error 
loss)  makes  no  sense.    Distance  or  difference,  in  this  case  from  the 
relevant  cut-off  score,  is  not  a  concern;  rather  the  concern  is  simply  whether  the 
examinee  located  either  above  or  below  the  cut-off  score  is  correctly  as- 
signed to  the  proper  mastery  state.    Thus,  the  appropriate  loss  function 
in  this  decision-theoretic  process  would  be  a  threshold  loss  function. 
On  the  other  hand,  Livingston  (1972,  1975),  and  Linden  and 
Mellenbergh  (1977)  have  investigated  both  linear  and  non-linear  loss 
functions.    Here,  one  assumes  the  mi^lassif ication  of  an  examinee  with 
a  domain  score  far  from  the  cut-off  score  is  far  more  serious  than^the 

M 

loss  incurred  when  an  examinee  with  a  domain  score  close  to  a  cut-off 
score  is  misclassif ied. 

For  the  test  practitioner  who  lacks  the  facilities  to  enable 
him/her  to  use  the  somewhat  complex  methods  that  follow,  then  he/she 
should  determine  mastery  status  by  comparing  an  individual's  proportion 
correct  score  to  the  cut-off  score.    However,  such  a  procedure  suffers 
the  same  problems  discussed  in  section  8.3.1.     If  the  number  of  items 
measuring  an  objective  is  small,  then  the  proportion  correct  score  will 
often  give  an  unreliable  estimate  for  determining  mastery  status.  Also, 
and  perhaps  more  of  a  problem,  when  using  tfiis  procedure,  all  classifi- 
cation errors  must  be  assumed  to  be  equal.    That  is  why  the  procedures 
to  be  discussed  is  section  8.4.2  are  so  valuable;  they  incorporate  addi- 
tional data  into  the  estimates,  thereby  decreasing  the  error,  and  they 
allow  for  the  consideration  of  different  classification  errors. 

Throughout  this  introduction,  we  have  been  implicitly  assuming  the 
existence  of  only  two  mastery  states,  master  and  non-master.  However, 

455 


9 

ERIC 


instance,  there  may  be  two  cut-offs,  tT  and  ntf  such  that  If  the  stu- 
dent's score  is  below  it  ,  he/she  is  retained.    If  his/her  score  is  ■ 

r. 

above  ira,  he/she  is  advanced,  and  finally,  if  the  score  is  between 
the  two  cut-off  scores,  the  individual  may  be  "held"  for  a  short  re- 
view.    In  the  development  that  follows,  the  procedures  are  first 
formulated  for  one  cut-off  point  (two  mastery  states),  and  then  extended 
to  k  mastery  states.  . 
The  problem  of  classifying  an  examinee  into  one  of  two  cate- 
gories using  a  threshold  loss  function  has  been  studied  extensively 
by  Hambleton  and  Novick  (1973),  and  Swaminathan,  Hambleton,  and 
Algina  (1975).    As  in     section    8.3.2,         the  observed  scores  x. 
are  transformed  into  gA  by  the  arc  sine  transformation.  Let 

it  and  ir0  denote  the  domain  score  and  cut-off  score  respectively,  and 
Y(=sin_1  /it)  denote  the  transformed  domain  score  it, 

and  Y0  (-sin"1  /*0)    -  the  transformed  cut-off  score.  Then,  examinees 

with  transformed  domain  scores  Y  less  than    yc  are  classified  as  non- 
masters  and  masters  otherwise.    Conforming  with  the  notation 
employed  by  Hambleton  and  Novick  (1973),  the  two-valued  parameter 
is      used      to  denote  the  mastery  state  of  an   examinee.    The  para- 
ge eun  u-,1upk    mi  or  iuo>     If  the  examinee  is  a 
meter  .vi  assumes  one  of  two  values,  u>i  «r  ^2 

non-master,  i.e.,  if  Y  <  Yc.  then 


0)   =    fiM  * 


whilo  if  tho  examiiUMj  Is  a  master,   i.e.,   i  ;i  YQ* 


456 


-20- 


in  classifying  an  examinee  the  decision-maker  may  take  one  of  two  act  Urns 
V^/or  exa^ic,       retain  the  examinee  for  instruction  or  advance  the 
examinee  to  the  next  segment  of  instruction.    The  action  "retain" 
will  be  denoted  by  «t  ant  the  action  "advance"  by  *r    The  decision- 
maker can  commit  one  of  two  kinds  of  errors.    If  the   examinee  is. 
in  reality  a  non-ma*ter  (in  state  t^),  the  decision-maker  can  clas- 
sify the    examinee   as  a. master  (in  state  «2>  or  if  in  reality  the 
examinee     is  a  master  (in  state       .  the  decision-maker  can  classify 
the    examinee    as  a  non-master  (in  state  mj.    In  order  to  arrive  at 
a  rale  for  selecting  actions  ^  or  ^  it  is  necessary  to  specify  the 
losses  associated  with  these  two   kinds  of  classification  errors. 

Swaminathan,  et  al.     (1975),    introduced    the  qusntity, 
Uuip  a*),    to  denote  the    non-negative    loss  function 
describing  the  loss  incurred  when  action  a^  is  taken  for  an  examinee 
who  is  in  state  w..    Thus,  in  the  two  category  decision  problem, 


L(uji  ,  a-,)  ■  fi.r. 


and 


wlt.li  v 


L(u'ot  a.)  =  I 


21  * 


> 


These    authors    have    suggested  that     the  action  For 
which  the  expected  loss 

K„h(..,  a) 


is  a  minimum  should  be    hosen  as  the  appropriate  .a  t  ion. 


45V 


best  en  r;siABU 


Swaminathan,      et  al.     (1975)      extended  the  tvo  cate- 
gory problem  to  one  where  examinees  are  classified  into  one  of  several 
categories.    Suppose  there  are  k  categories  into  which  the  examinees 
are  to  be  classified  and  consequently  k  actions  to  be  taken.    For  example, 
when  k=3,  the  decision-maker  may  be  interested  in  classifying  examinees 
as  masters,  partial  masters,  and  non-masters.    The  appropriate  actions 
may  be  to  advance  the  masters,  retain  ths  partial  masters  for  a  brief 
review,  and  retain  the  non-masters  for  remedial  work. 

We  need  k-1  cut-off  scores  to  separate  examinees  into  k  categories 

or  k  states,  u).,  u»2  «k.    Denote  the  cut-off  scores  by  ir^,  uo2, 

• . . »  ft  i   i «    An  examinee  is  in  state. u.,  when  her  domain  score  *  is  less 

OK-l  x 

than  it  . ,  in  state  u>2  if  *  is  between  irol  and  *o2,  and  so  on.     In  peneral 

an  examinee  is  in  state  w1  if  ifQi-1  1  v  <  7T0i' 

Associated  with  misclassif ications  is  the  loss  function  L^,  a.). 
If  an  action  a^  is  taken  for  an  examinee  who  in  reality  is  in  state  u^, 
the  loss  is         so  that 

L(wi,  a^)  *  9-^  . 

As  before,  the  action  which  has  the  smallest  expected  loss  is  chosen. 

*■ 

For  action  a.,  the  expected  loss  is  Riven  by 

E  L  (m.  a.)  »    T.      I  .  Prob  fy      ,  <  Y  <  V     'Data]  0  3) 

w       '     V  j_      P.1  "op-l  -  op 

where  Yqo  -  -  %  and  YQk  -  +         A' t ion  a.  is  chosen  if 

lc 

):    %j  Prob  [Yop-l  -  Y  YoP|l)ata] 
p=l 


k 

<    Z    v  Prob 
p=l 


[v      ,  <  v  . .  >     |  natal,  (n-1,  2,  k,  mtM )  •  04) 

1  Yop-l  -  op1 


«. 

I 


0 

ERIC 


-22- 

Once  the  posterior  distribution  of  y  is  determined,  the  above  probabilities 
are  determined  as  the  area  under  the  probability  density  curve  between 

> 

^op-1  and  Yop  ■  P  *       2»  ••••  k' 

The  next  stage  in  the  decision-theoretic  process  is  to  obtain  the 

posterior  distribution  of  parameter,  y»  to*  each  examinee.  Several 

procedures  are  available  for  the  determination  of  posterior  distributions 

and,  hence,  posterior  probabilities.    The  first  method  is  that  given  by 

Lewis,  et  al.  (1973).    Utilizing  the  distributions  and  assumptions  given 

v'  1 
in  connection  with  the  Bayesian  Model  II  estimates  in  a  previous  section, 

Lewis,  et  al.  (3.973)  derived  an  approximation  to  the  posterior  distribu-  "< 
tlon.  They  showed  that  the  posterior  distribution  of  y±,  is  approximately 
normal,  i.e. , 

2 

(Y4  |  Data)^  N(uir  o±  )  (15) 


whore 


"I 

Hi  =  g.  +  P*(gi  -  gO.  (16) 


and 


(This  approx 


ination  is  reasonably  good  when  the  number  of  test  item* 


45y 


-23- 


exceeds  seven.)    The  quantity  g.  is  defined  by  Equation  (A)*  The 

2 

quantities  p*  and  o*    in  expressions  (ih)  and  (|/)  are  dependent  on 

the  parameters  v  and  X  uf  the  inverse  chi-squarc  dl  r.lriluilion  of    ,  and 

have  to  be  computed  by  numerical  integration. 

The  tables  prepared  by  Wang  (1973)  can  be  used  by  specifying 

v  and  A,     to  obtain  p*  and  o*  .  x 

Returning  to  the  problem  of  classification  of  examinees 

into    k    categories ,  Swaminathan  et  al.  (1975)  first  transform  the  (k-1)  specified 

cut-off  scores  7T      into  y  ,  given  by 

op  op 

-1 


Y  =  sin  /rr  t  p  *  l,...,k-l.  (18) 
'op  op  * 

Next  the  probabilities  of  the  type  given  by  Equations  (13)  and 
(14)  are  calculated.    For  any  examinee, 


ProbfTT         <  7T    <  7T      I  Data]  «  Prob  [y      ,   <  Y    <  Y  ^  I  Data].0  (19) 
oP-1  —  op   1  op-1  —  op 

For  the  ith  examinee,    the  quantity  is  detined  as 


Vl  ~  Mi     •  (20) 


z 


oji  oi 


with  p    and  c^2  defined  by  Equations  (16)     and  (17).    The  quantity 

z        is  the  normal  deviate  corresponding  to  the  cut-off  score  y  . 

oji  I 

for  examinee  i.     Since  the  posterior  distribution  is  approximately 

2 

normal  with  mean  p.  and  vai iancc  n  . 

i  i  > 

Probb'      ,   <  Y,  <  Y       !  Data]  *  Prob[z      .  .   <  z    <  z     .   |   Data].  (.M) 
LTop-l  —  Ti       op  1  op-H  -    i  opi 


er|c  460 


-24- 


That  is,  the  probability  that  YjL  is  between  Yop-1  and  yop  is  approx- 
imately equal  to  the  probability  that  a  tjumil.irri  J  ;'.<«*!  normal  varlate 

is  between  the  z  scores  z     «  and  z    .    Hence,  for  each  examinee  i, 

op-1  op 

the  quantity  » 

k 

2 

P" 

is  calculated  ^or  each  action  j  (J»l,  2,...,k).    These  k  expected 


>  HwL(W,a.)  -    £    ipj  Prob[zop_li  <  z±  <  zQ?i  \  Data] 


(22) 


losses  are  than  compared  with  one  another,  and  the  action  for  which 
the.  expected  loss  is  the  least  is  chosen,  as  the  appropriate  action; 
An  illustration  of  the  procedure  is  offered  in  the  next  section. 


461 


*8.4.2    A  Bavesian  Decision  Theoretic  Procedure  j 
The  paper  by  Swarainathan,  Hambleton,  and  Algina  (1975)  describes 
one  method  for  using  Bayesian  decision- theoretic  procedures  to  allocate 
examinees  to  mastery  states. 


462 


JOURNAL  OK  EDUCATIONAL  MEASUREMENT 
\  *  VOLUME  1.2.  NO.  2  SUMMER  1975 


A  BAYESIAN  DECISIQN-THfeORETIC  PROCEDURE  FOR 
USE  WITH  CRITERION-REFERENCED  TUFTS' 

H.  SWAMINATHAN,  RONALD  K.  HAMBLETON,  and  JAlilES  ALGINA 

University  of  Massachusetts  / 

In  a  previous  paper,  Har/bleton  and  Novick  (1973)  conceptualized  a  decisij 
theoretic  formulation  for  several  issues  in  criterion-referenced  measurement.  Among 
the  issues  discussed  was  the  important  problem  of  allocating  individuals  to  Mastery 
states  These  authors  proposed  a  solution  to  the  problem  based  on  a  Bayfcsian  pro- 
cedure given  by  Novick,  Lewis,  and  Jackson  (1973).  Mere  *ecemly,Xewis,  Wang, 
and  Novick  (1973)  have  developed  a  Bayesian  procedure  that  is  m6rc  appropriate 
in-the  context  of  criterion-referenced  measurement.  'Bascd^on^ this  most  recent 
method,  we  present  iruthis  -paper  an-exposttton  of  a  ilecisio^theoretic  solution  to 
the  problem  of  allocating  individuals  to  mastery  states  on>e  objectives  included  in 
a  criterion-referenced  test. 

Allocation  of Individuals  to  Mastery  States 
.  The  primary  problem  in'ci  iterio/weferenced  measurement  is  that  of  classifying  an 
examinee  into  one  of  several  mutually  exclusive  mastery  states  or  categories.  One 
might  think  of  mastery  states,  definedlpr  an/dbjective,  as  representing  different  levels 
of  functioning  on  the  domain  of  itemsWasuring  that  objective.  It  makes  sense  to 
assume  that  each  examinee  has  a  true  mastery  state  on  each  objective  covered  in  a 
criterion-rcfercnced  test.  Typically,  a  cut-off  score  or  threshold  score  is  set  to  permit 
the  decision-maker  to  assign  examinees,  on  the  basis  of  their  performance  on  each 
subset  of  items  measuring  an  Objective  covered  in  a  criterion-referfenced  test,  into  one 
of  two  mutually  exclusive  categories-masters  and  non-masters.  (See,  Millman,  19/3, 
for  a  good  discussion  of  guidelines  for  setting  cutting  scores.)  Since  all  the  items  in 
the  domain  of  items' measuring  an  objective  cannot  usually  be  administered  to  the 
examinee,' a  small  number  of  items  is  sampled.  Thus  the  problem  of  classifying  the 
'  examinees  into  categories  has  to  be  considered  within  a  statistical  framework. 

An  obvious  approach  to  the  allocation  problem  is  to  compare  an  examinee's  ob- 
served score  to  the  threshold  score  and  make  the  appropriate  mastery  decision.  How- 
ever,  as  criterion-referenced  tests  are  typically  short,  we  would  be  making  decisions 
on  the  basis  of  very  limited  amounts  of  information.  Our  decision-theoretic  approach 
to  the  allocation  problem  allows  the  decision-maker  to  build  into  the  decision 
process  prior  and  collateral  information  about  the  examinee's  true  mastery  slate. 
This  approach  is  not  unlike  that  of  using  a  regression  line  to  estimate  true  scores, 
and  provides  a  way  ofobtaining  more  information  on  each  examinee  without  requir- 
ing the  administration  of  additional  test  itcms-a  great  advantage  indeed  when  one 
considers  the  amount  of  time  typically  allotted  for  criterion-referenced  testing  in 
objcctives*based  programs  (see,  for  example,  Glascr  &  Nitko,  1971  pp.  625-670; 
Hambleton,  1974;  Hambleton  &  Novick,  1973).  However,  even  after  incorporating 
this  additional  information,  our  knowledge  of  an  examinee's  true  mastery  state  will 

•The  authors  arc  grateful  to  Ming-Mei  Wang.  Douglas  Coulson,  and  Jason  Millman  for  helpful  com- 

87 


mcnts  on  earlier  drafts  of  the  paper 

A> 


463 

er|c 


88  SWAMINATIIAN,  HAMHLETON,  AND  M.GINA 

be  probabilistic  and  misclassifications  will  be  likely  to  occur.  The  decision-theoretic 
approach  also  allows  the  decision-maker  to  incorporate  into  the  decision  process  the 
costs  of  misclassifications. 

Classification  of  Examinees  Into  One  of  Two  Categories 

We  shall  first  consider  the  problem  of  classifying  an  examinee  into  one  of  two 
categories  andiater  generalize  the  procedure  to  include  several  categories. 

Let  7  denote  the  "true"  score  of  an  examinee.  We  will  see  later  that  7  is  related 
in  a  very  simple  way  to  x,  the  true  proportion-correct  score.  The  quantity  ir  is  defined 
as  the  proportion  of  items,  in  the  domain  of  items  measuring  the  objective,  that  an 
examinee  can  correctly  answer.  If  y0  is  the  predetermined  threshold  or  cut-otl'  score, 
examinees  with  true  scores  y1  less  than  y0  are  classified  as  true  non-masters  and  true 
masters  otherwise.  In  keeping  with  the  notation  employed  by  Hambleton  and  Novick 
(1973)  let  the  two-valued  parameter  w  denote  the  mastery  state  of  the  examinee.  The 
parameter  o>  assumes  one  of  two  values,  wj  or  cu2.  If  the  examinee  is  a  non-master, 
i.e.,  if  7  <  7«.w«$et 

w  -  W|»  ] 

and  if  he  is  a  master,  i.e.,  y  >  yet  we  set 

Both  y  and  u>  are,  of  course,  unobservable  quantities.  Our  approach  \l  to  produce, 
using  Bayesian  statistical  methods,  a  distribution  representing  our  belief  about  the 
location  of  the  parameter  y.  Using  this  distribution,  known  as  the  posterior  distribu- 
tion on  the  true  score  parameter,  7,  and  with  a  cutting  score  defined,  wfc  can  produce 
probabilities  representing  the  chances  of  an  examinee  being  located  iri  each  mastery 

state.  / 

In  classifying  the  examinees,  the  decision-maker  may  take  one  of  t*o  actions-re- 
tain the  examinee  for  instruction  or  advance  the  examinee  to  the  riext  segment  of 
instruction.  The  action  "retain"  will  be  denoted  by  «,  and  the  actioh  "advance"  by 
a2.  The  decision-maker  can  commit  one  of  two  kinds  of  errors.  If  ihj  individual  is  in 
reality  a  non-master  (in  state  i*>,),  the  decision-maker  can  classify  the  individual  as  a 
master  (in  state  wj)  or  if  in  reality  the  individual  is  a  master  fin  state  w2),  the 
decision-maker  can  classify  the  individual  as  a  non-master  (in  slate  w,  ).  In  order  to 
arrive  at  a  rule  for  selecting  actions  a,  or  o2,  it  is  necessary  to  /specify  the  losses 
associated  with  these  two  kinds  of  misclassifications. 

Consistent  with  the  usage  and  notation  of  decision  theory,  we  shall  employ  the 
notation  Z,(«„  a})  to  denote  the  non-negative  loss  function  which  describes  the  loss 
incurred  when  action  0,  is  taken  for  the  individual  who  is  in  state  u,.  Thus, 

JL(wi,flj)  -  ^12. 

and 

^(wj.fli)  *  -£21- 

Of  course, 

£(w,,a,)  «  L(ui%ai)  *  0.  / 
A  good  classification  procedure  is  obviously  one  which  minimizes  in  some  sense 


464 


-28- 


ERIC 


A  BAYESIAN  PROCEDURE  FOR  CRITERION- REFERENCED  TESTS 


89 


or  other  the  total  loss  Incurred.  That  is,  we  shall  choose  that  action  for  which  the 
expected  loss,  /  i 

£„!(«,  a),  j 

/  ' 
is  a  minimum.        /  u 
We  see  that  if  actibn  a,  is  taken,  then  the  expected  loss,  £„L(w,  a,),  is  given  by 

£wI.(w,fli)  «  0-Prob[w  »  a»i]  +  in  Prob[w  »  oj2]  ' 
j  «^2|Prob[7  £  7.1-  <la> 

Similarly,  ifactior^  a2  is  taken,  then  the  expected  loss,  £w/.(w,  a2),  is  given  by  , 
Mo>,<*i)  -  ^i2Prob(a>  -  w,]  +  0-Prob[w  =  w2l 

-4i2Prob[7<7«,l.  <lb> 


> 


We  take  action  a,  if 


£„£(«,  Oi)  <  £„L(w,a2). 


or  equivalent^  if 

/  l2\  Prob[7  *  7.1  <  <tu  Probl7  <  7*1  •  (2a> 
Similarly,  ^e  take  action  o2  if  ' 

^»Probl7  <  y0)  <  li\  Prob(7  >  7.1-  (2b) 

If  it  so  happened  that  1 

£i2Probl7  <  7.)  -  U\  ProblY  >  7.1. 

we  would  be  indifferent  as  to  which  action  to  take. 

In  order  to  clarify  the  meaning  of  Prob[7  <  7-1  and  Prob[7  >  7.1.  and  hence  the 
expectecLlpsses  given  by  (2a)  and  (2b),  wc  have  to  distinguish  between  prior  proba- 
bilities tni-posterior  probabilities,  Imsimplistic  terms,  prior  probab.ljt.es  are  based 
on  our  beliefs  about  the.parameter  7  before  any  test  data  are  obtained!  For  example, 
we  often  have  information  about  the  ability  levels  of  the  students  in  a  program  in  the 
form  of  school  records,  their  past  ^performance,  etc.  This  information,  conveniently 
summarized  in  the  form  of  the  prior  distribution /(7)  of  the  parameter  7.  reflects  our 
prior  belief  before  new  test  data  arc  obtained.  The  posterior  probabilities  on  the 
other  hand,  are  based, on  our  revised  belief  about  the  parameter  7  after  the  test  data 
are  obtained.  And  this  belief  is  summarized  by  the  posterior  distribution  of  the 
parameter  7,  denoted  by  say.  h{y  I  Data).  In  the  language  of  staustics.  h{y  I  Data). 
,  the  posterior  distribution  of  7.  is  the  conditional  distribution  of  7,  g'v«n  ™ 
data.  The  area  under  the  curve  h{y  I  Data),  below  y„  gives  the  probability 
Probl7  <  ye  I  Data]/ and  the  area  above  y„  gives  the  probability.  Prob[7  >  7,  I  Data). 

Unfortunately,  the  posterior  distribution  for  each  examinee  is  not  obtainable 
directly.  The  first  slage  in  obtaining  this  posterior  marginal  distribution  is  to  obtain 
the  joint  posterior  distribution  of  all  the  m  examinees.  A(7i.  7:  7«  I  Data). 

As  a  consequence  of  Hayes  Theorem,  the  posterior  joint  distribution  is  readily  ex- 
pressed in  terms  of  the  joint  prior  distribution  f{yu  y2  7m) as 

/j(7,,  7,  ym  I  Data)  *  *(Data  |  7i.  72  7«)/(7i.  7j  7*)-  (3) 


465 


/ 


-29- 


90  SWAMINATHAN,  HAMBUTON,  AND  ALGlNA/ 


The  expression  g(Data>|  7,,  72  7«)  »s  Known  as  tnc  livelihood  function  and  is 

a  statement  of  thc  joint  probability  of  observing  the  data  conditioned  upon  the  un- 
known parameters  71,  72  7m.  .We  shall  return  to  the**discussion  of  obtaining 

thc  posterior  marginal  distribution  from  the  joint  posterior  distribution  in  the  next 

'section.  ' 

The  probabilities  in.  expressions  (2a)  and  (2b)  arc.jn  actuality,  posterior  proba- 
bilities and  hence  should  be  so  denoted.  Thus,  we-take  action  ax  if  t  -  » 

In  Prob(7  $  7*  I  Data]  <  In  Probfr  <  .7 J  Datai,  •  (4a) 

and  take  action  a2  if  /  \  J 

in  Prob(7  <  7»  I  Data]  <  In  Prob(7  >  7*  I  Dajfa]/  (4b) 

Description  of  the  Bayesian  Decision-Theoretic  Procedure  t 

We  begin  by  assuming  that  the  rth  examinee  is  administered  a  random  sample  of 
n  items  measuring  a  particular  objective.  An  examinee  has  a  true  proportion-cbrrect 
score,  *,  defined  over  thc  domain  of  items  measuring  the  objective.  Although  it 
would  be  possible  to>obtain  an  estimate  of  ir  on  the  basis  of  the  examinee's  perfor- 
mance on  thc  sample  of  test  items,  this  is  not  our  primary  aim.  If  we  consider  testing 
within  a  decision-making  framework,  then  temakedecisions  concerning  an  examinee's 
mastery  state,  we  need  the  posterior  probabilities  of  the  kind  described' by  Equations 
(4a)  and  (4b).' 

•  Since  it  is  mathematically  inconvenient  to  work%ith  ir,  we  shall,  following  Novick 
etal.  (1973),  utilize  the  transformation 

7  -  sin-'x/irV    '  (5) 

and  obtain  the  posterior  distribution  of  7  instead  of  it.  To  be  compatible  with  this 
transformation  all  of  our  observed  test  scores  will  need  to  be  transformed  to.  the 
7-metric.  This  is  easily  accomplished  by  transforming  the  test  score  xt  of  the  Ah 
examinee  into  ,    ,  ; 

.     .  sin:»K*,  +  3/8)/(/!  +  3/4)|">.         1        '  *\  (6) 

This  particular  transformation,  which  has  been  discussed  in  some  .detail  by  Novick 
et  al.  (1973).  is  attractive  because,  for  examinee  /,  the  distribution  of  £,  is  approxi- 
mately normal  with  mean  yt  -  sin"1  vTF,  and  constant  variance  V  «  (4«  +  2)"'. 
The  approximation  is  reasonably  good  "when  it  lies  between  .15  and  .85  and  n,  is  at 
least  8.  Since  the  distribution  of  g,  has  known  variance  but  unknown  mean  7*.  the 
distribution  of  g,  is  customarily  expfessed  as  a  conditional  distribution,  i.e., 

,    fiiYi- *<iri.n  <7>' 

where  N(y„  V)  represents  ihc  normal  distribution  with  mean  yt  and.  variance  V. 
Referring  to  Equation  (3)  we  can  sec  that  in  order  to  obtain  the  posterior  distribution 
for  each  7,,  we  need  the  likelihood  function  g<Data  I  yu  Y2.  •  •  • »  7«)-  The  product 
of  them  distributions  given  by  Equation  (7),  where  m  is  the  number  of  examinees  in 
the  sample,  yields  the  likelihood  function.  V  0 

In  order  to  obtain  the  posterior  distribution  of  wc  have  to  specify  our  prior 
knowledge  about  the  distribution  of  7,.  We  assujne  that  the  transformed  "true" 
scores,  7,, 72  7«,  of  the  m  individuate  exchangeable.  This  amounts  to  say- 


■466 


-30- 


A  BAYKSIAjN  PROCEDURE  FOR  CRITERION-REFERIiNCIil)  TKSTS  91 

ing  that  our  prior  belief  about  one  yt  is  no  different  than  our  belief  about  any  other 
>y  and  implies  the  assumption  that  each  y,  is  randomly  sampled  from  some  distribu- 
tion. In  particular,  we  assume  that  the  prior  distribution  of  7,  is  normal  with  un- 
known mean  a  and  unknown  variance  0.  Thus,  the  specification  of  the  prior  distri- 
bution of  y,  is  dependent  upon  our  knowledge  of  the  mean  a  and  the  variance  0. 
However,  Novick  ct  al.  (1973)  suggest  that  our  prior  belief  about  a  may  not  be  as  im- 
portant as  our  prior  bclycf  about  0.  The  above  authors  have  assumed  that  it  is 
reasonable  to  represent  our  belief  about  0  by  an  inverse  chi-square  distribution  with  v 
degrees  of  freedom  and  scale  parameter  X  (see  Novick  &  Jackson,  1974,  for  an  exten- 
sive discussion  of  this  distribution).  Specification  of  the  prior  belief  about  0  thus  re- 
quires the  specification  of  only  the  tw^  parameters,  v  and  X. 

,  Novick  et  al.  (1973)  have  considered  in  detail  the  problems  of  setting  values  of  the 
parameters,  v  and  X.  Based  on  theoretical  considerations,  these  authors  recommend 
setting  v  -  8.  The  mean  $,  of  the  inverse  chi-square  distribution  is  given  by  X/(f-2), 
and  once  v  is  known,  X  can  be  set  equal  to  {i>-2)  $.  To  estimate  0  we  are  required  to 
indicate  the  amount  of  information  we  have  about  ir.  This  is  accomplished  by  specify- 
ing a  value  A/,  where  M  is  considered  to  be  the  ir  value  of  the  typicardx«mince  in  the 
sample.  We  then  specify  the  number  of  test  items,  /,  we  would  need  to  administer 
to  the  examinee  in  order  to  obtain  as  much  information  about  ir  as  we  feel  we  now 
have.  Transformed  estimates  of  ir,  from  a  f-item  test  are  distributed  normally  on  the 
7-mctric  with  variance  (4/  +  2)"1.  Hence,  we  could  lake  (4/  +  2)"'  as  our  estimate  of 
0  and  subsequently  specify  X. 

Specification  of  »  and  X  in  essence  determines  the  prior  distribution  f(y)  of  71, 
72, ....7m-  Substituting  this  in  Fquation  (3),  Novick  et  al.  (1973)  obtained  the 
joint  posterior  distribution  of  the  parameters.  This  joint  posterior  distribution  of 

Y,,7,  ym  is  useful  for  making  joint  decisions  about  the  m  individuals.  However, 

in  criterion-referenced  testing  situations  we  are  interested  in  making  separate  de- 
cisions about  each  individual  and  hence  we  require  the  distribution  of  each  7,;  i.e., 
the  marginal  distribution  of  7,. 

It  has  been  shown  by  Lewis  ct  al.  (1973)  that  the  posterior  marginal  distribution  of 
7i,  our  belief  about  the  location  of  the  /th  examinee's  score  on  the  7-metric,  is  ap- 
proximately normal,  i.e., 

7,1  Data  ~  N{m,o))  (8) 
Mi  -  I  +  P*(g>  ~  *)>  W 


where 
and 


„.  .  1  t,"""       +  to.  -  *>'  C°) 

(4/j  +  2)  mi 

(This  approximation  is  reasonably  good  when  the  number  of  test  items  exceeds 
seven.)  The  quantity    is  defined  as 

HI 

g  -  »r'  X  Si- 

fa  I 

The  quantities  p*  and  a*2  in  Equations  (9)  and  (10)  arc  dependent  on  the  parameters 


467 


-31-  •  \ 

;  \  '  \ 


92  *  SWAMINATHAN,  1IAMBI.ETON,  AND  ALGINA 

v  and  X  of  the  inverse  chi-square  distribution  of  0,  and  have  to  be  computed  by  nu- 
merical integration.  Wang  (1973)  has  prepared  a  set  of  tables  so  that  on  specifying 
»  and  X,  p*  and  a*2  may  be  obtained. 

Returning  to  the  problem  of  classification  of  sludents  into  masters  and  non- 
masters,  we  first  transform  the  specified  cut-olf  score  »„  into  y0,  given  by 

i 

y0  -  sin"1  VwJ.  00 

4 

Now  we  have  to  calculate  the  probabilities  necessary  for  comparisons  of  the  type 
given  by  Equations  (4a)  and  (4b).  For  any  individual  it  should  be  clear  that 

Prob[ir,  >  w.|  Data]  =  Prob[?,  >  y J  Data]. 

p 

For  each  individual  we  define  the  quantity  as 

f-  „  y^Jti,  02) 

with  fi,  and  a]  defined  by  Equations  (9)  and  (10).  Since  the  posterior  distribution  is 
approximately  normal  with  mean  /i(  and  variance  or2, 

'  Prob[7,  >  y0\  Data]  ^  Prab[z  >  zj  Data], 

That  is,  the  probability  that  y,  is  greater  than  y0  is  approximately  equal  to  the  prob- 
ability that  a  standardized  normal  variate  is  greater  than  the  z  score,  zoi.  Hence 

4i2Prob[z  <  z,i\  Data] 

can  be  compared  with 

Prob(z  >  zoi  I  Data] 

and  the  appropriate  decision  made. 

For  convenience  we  shall  summarize  the  procedure  by  outlining  the  steps  taken  to 
arrive  at  the  appropriate  action  and  illustrate  the  procedure  with  a  hypothetical 
example.2  The  steps  are: 

1.  Transform  the  number  corre.  <  x,  for  the  ith  examinee  into  g„  given  by 

gl  -  sin-'[(*/  +  3/8)/(w  +  3/4)]"J. 

2.  Specify  the  cut-olT  score  ir0  and  obtain  the  corresponding  y0,  given  by 

ye  -  sin"1  Vt0. 

3.  Specify  the  prior  distribution  of  0  by  specifying  the  parameters  v  and  a. 

4.  Obtain  p*  and  a*1  as  tabulated  by  Wang  (1973)  and  hence  determine  the  mean 
H,  and  variance  a]  of  the  posterior  distribution  of  7.  given  by  equations  (9)  and 
10. 

5.  Obtain  the  standardized  normal  deviate 

z«  ■  (7«  -  »M"> 

'For  another  description  of  the  steps  we  refer  the  reader  to  the  excellent  document  or,  cnterion. 
referenced  measurement  prepared  by  Millman  (1974.  pp.  311  397). 


468 


-32 


A  BAYCSIAN  PROCEDURE  FOR  CRITERION-REFERENCED  TESTS 


93 


and  hence  determine  the  probability,  Prob[2  >  tM  \  Data),  which  is  approximately 
equal  to  Prob(ir,  >  x0 1  Data]. 
6.  Make  the  decision  according  to  Equa  .ions  (4a)  or  (4b). 

We  will  illustrate  the  above  procedure  by  the  following  hypothetical  example.  Our 
data  and  results  arc  summarized  in  Tables  I  and  2.  Suppose  that  a  set  of  10  items 
representative  of  the  domain  of  items  measuring  an  objective  is  administered  to  a 
group  of  25  examinees,  and  that  the  cut-off  score  ir0  is  set  at  .80.  First,  we  transform 
the  observed  scores,  x,  into  gt,  and  the  cut-off  score  tt,  into  y0.  Next,  we  must  specify 
our  prior  beliefs  about  0.  As  indicated  earlier,  we  do  this  by  choosing  v  and  X,  the 
parameters  of  the  distribution  that  we  use  to  represent  our  belief  about  0.  In  order  to 
determine  v  and  X,  we  must  decide  the  length  of  the  test  that  would  be  required  to 
give  us  as  much  information  as  we  feel  we  now  have  about  any  examinee's  true 
mastery  score  Suppose  that,  in  our  example,  we  decided  that  a  five-item  test 
would  be  required.  We  therefore  take  /  -  5  and,  hence,  (4/  +  2)"'  -  .0455,  as  our 
value  for  0~.  Since,  in  general,  a  good  value  for  v  is  eight,  the  value  for  X  is  .2730,  be- 
cause X  -  (f  -  2)  0.  Using  the  tables  prepared  by  Wang  (1973),  we  find  p*  ■  .5335 
and  a*2  -  .0159.  We  now  have  enough  information  to  compute  v,  and  a,  using 
Equations  (9)  and  (10).  Finally,  we  obtain  the  standardized  normal  deviate  given  by 
Equation  (12)  and  using  the  tables  of  the  standardized  normal  distribution  find  the 
approximate  probability,  Prob(r,  >  .8  I  Data]  and  its  complement  Prob(*,  <  .8  | 
Data].  Suppose  also  that  the  loss  associated  with  a  false-positive  error,  -£,2,  is  taken 
to  be  one  "unit"  and  the' loss  associated  with  a  false-negative  error,  Zn*  to  be  two 
"units."  In  order  to  make  a  decision  about  each  examinee  we  weight  the  appropriate 
probability  by  the  associated  loss  and  obtain  the  expected  loss  for  each  action.  Thus, 


Tabla  1 

Bayaatan  Analysis  of  a  Hypothetical  Sat  of  Data:   n-10,  sp25 


Jtabar  of 
It  in  Correct 

»i 

Frequency 

Tranaforwad 
Obacrvad  Scora 

l«i 

Marginal 
Keen 

Kirginal 
Standard 
Devi at Ion 

•i 

4 

.695 

.636 

.121 

5 

.765 

.661 

.118^ 

6 

.675 

.933 

.116 

7 

.980 

969 

.115 

a 

*  1.063 

1.043 

.115 

f 

1.202 

1.107 

.118 

10 

1.392  * 

1.211 

.125 

o 

ERIC 


469 


BEST  r°nv  M,fiU  m^ 


-33- 

94  SWAMINATHAN,  HAMULETON,  A  ID  Al.GINA 


TtbU  i 

t 

*  DtcUtoifMaktnt  in  th.  TVo-C.t«»ory  Cl«««t(lc«tlen  ProM.ai   n»10,  wii,  ^"l. 


Muabt r  of 
Xtee*  Correct 

"i 

Pvob(i1<.8|D*t») 

Probing.  8|Det# 

Expected  l^Bftfi 

Action 

Action  •1  (Retain) 
•l21  Vrob(«4*.8|Data] 

Action  ftj  (Advance) 
■t12  Probity. 8|DetaJ 

4 

.m 

.012 

.025 

•  988 

reteln 

5 

.972 

.028 

*056.  ..... 

  .972 

retain 

• 

.931 

.069 

.139 

•931 

retain 

7 

.949 

•  151 

•  302 

•849 

reteln 

• 

•  710 

.290 

.579 

.710 

retain 

•  502 

.499 

.994 

.502 

edvence 

10 

•  231 

.770 

1.539 

.231 

advence 

v  -  eio"1  ST~4l.lM 

O  O  1 


taking  the  los«s  into  account,  examinees  with  nine  or  ten  correct  items  arc  advanced^ 
while  examines  with  less  than  nine  correct  items  are  retained  for  instruction.  By 
manipulating  the  various  losses  in  the  example  it  is  easy  to  see  how  other  decisions 
may  be  made. 

Classification  of  Examinees  Into  One  ofk  Categories 

Suppose  that  there  are  it  categories  into  which  the  examinees  are  to  be  classified, 
and  consequently,  k  actions  to  be  taken.  For  example,  when  k  «  3,  the  decision- 
maker may  be  interested  in  classifying  examinees  as  masters,  partial  masters,  or  non- 
masters.  The  appropriate  actions  may  be  to  advance  the  masters,  retain  the  partial 
masters  for  a  brief  review  and  send  the  non*masters  for  remedial  work. 

In  order  to  separate  examinees  into  k  categories  or  k  states,  w\ ,  w2, . . . ,  w*,  we  need 

k-\  cut-off  scores.  Denote  these  by  it,,,  tto2  7roiM.  Hence,  an  examinee  is  in  state 

oh,  if  his  true  proportions-correct  score  ?r  is  less  than  7rfl|,  in  state  if  his  score  tt  is 
between  tt0|  and  7ro2,  and  so  on.  In  general  an  examinee  is  in  state  w,  if  tt^i  <  tt  < 
In  addition,  we  denote  the  set  of  k  actions  by  ax%  a^ . . . ,  ah  . . ak.  Action  ai  is 
to  be  taken  if  th<5  examinee  is  classified  into  state . 

Associated  with  misclassifications  is  the  loss  function  /-(w,,  a,).  If  an  action  a}  is 
taken  for  an  individual  who  ip  reality  is  in  state  w,,  the  loss  is  <£v  so  that 

'  •      L(wi%  aj)  = 

These  losses  are  conveniently  displayed  in  Table  3.  As  before,  we  choose  the  action 
which  has  the  smallest  expected  loss.  Here  again  we  utilize  the  transformation  pre- 
sented in  Equation  (5). 
For  action  ar  the  expected  loss  is  given  by 


470 


A  BAYESIAN  PROCEDURE  FOR  CRITERION- REFERENCED  TESTS 


95 


Table  Za 
Loss  Table  for  a 
Multi-Action  Problem 


State 

Action 

al 

A 

a2 

•  ft 

A 

t     t  t 

k 

wi 

(Y  <  YoX) 

0 

i12 

_   

•  •  t 

r— *-  

•  ft 

w2 

421 

0 

•  •  t 

•    t  • 

*2k 

•  •  • 

•  •  • 

•  •  • 

•  •  t 

t    •  t 

t    •  t 

t    t  t 

Wi 

•  •  • 

•    •  • 

'.Ik 

•  ft 

•  •  • 

AT 

•  •  • 

•     •  • 

•  ft 

t    t  t 

wk 

<\>k-l  ^  ^ 

hi 

Ak2 

t  •  t 

V 

•    •  • 

0 

EmL{w%at)  »  S^^Problv^.,  <  y  <  7« I  Data), 
where  7^  =  -«>,and7^  «■  +  «.  Thus  action  ^  is  chosen  if 

X)  It  Prob(7c/-i  <  7  <  7*  I  Data]  < 

■•' 

£  -UProb[v*-i  <  7  <  7<J  Data]  (/>  «  1,2  k,p  *  j). 

The  probability  Prob[yol_i  <  7  <  7*1  Data]  is  calculated  in  the  manner  described 
in  the  last  section. 

In  order  lo  illustrate  the  procedure  in  the  multiple  action  problem,  we  utilize  the 
hypothetical  data  given  in  Tabic  1.  Suppose  that  the  losses  associated  with  wrongly 
classifying  an  examinee  into  one  of  three  categories,  masters,  partial  masters,  and 
non-masters,  are  as  reported  in  Table  4. 

Assuming  that  the  cutting  scores,  ire]  and  nft2*  arc  .60  and  .80  respectively,  and 
working  with  the  posterior  distribution  of  7  for  each  examinee  in  exactly  the  same 
manner  as  in  the  previous  example,  it  is  possible  to  calculate  the  probability  of  each 
examinee  being  in  any  of  the  three  mastery  state  .  The  hypothetical  probabilities  re- 
ported in  Table  5  arc  the  probabilities  associated  with  an  examinee  being  in  any  of 


471 


96  SWAMINATHAN,  HAMBUfON,  AND  ALUINA 

Table  A 

Hypothetical  Losses  for  the  Three-Action  Problem 


Action 

State 

al 

(Remedial  Work) 

a2 

a3 

Non-Master 

0 

2 

3 

Partial  Master 

1 

0 

2 

Master 

i 

2 

i 

if 

1 

0 

these  three  categories.  These  probabilities,  when  combined  with  the  loss  structure 
\  presented  in  Table  4,  would  result  in  examinees  with  six  or  fewer  correct  items  being 
retained  for  remedial  work,  examinees  with  seven,  cighCor  nine  correct  items  being 
retained  for  a  brief  review,  and  examinees  with  ten  items  correct  being  moved  ahead. 

TabU  3 

V 

*»cltloo-Kakln|  la  tho  Throt-CoUgory  CWtelf  Uetlon  ProbXtat   ©»10,  »»23 


MMr  of 

Exvtctid  Lou 

ta 

Itoaj  Correct 

Preb(«1<.6|Dita) 

Prob(.6<i1<.8|Djt«) 

«rrob[«4>.8|D#uJ 

Actio* 

Aetloa  «2 

Aetloa  «j 

kc  t  Ion 

4 

.451 

.012 

.553 

1.330 

2.433 

rotaU 

5 

.311 

.456 

.02t 

.512 

;  1.040 

2.440 

rotoU 

• 

.MS 

.5a* 

.009 

.724 

-  .739 

2.207 

ratal* 

7 

.124 

.445 

.151 

.947 

.519 

1.112 

rataia 

brUfiy 

a 

.017 

.423 

.210 

.  1.203 

.444 

1.507 

retain 
'•riafly 

9 

.031 

.471 

.490 

1.447 

.540 

1.033 

rataia 
•riafly 

10 

.003 

.225 

.770 

1.743 

.710 

•443 

— 1_  . — 

tdvaac* 

t  .  •  aia"1  /T-  1.107 

Conclusion  3  v 

The  procedure  described  in  this  paper  should  be  feasible  with  objectives  bused 
programs  thai  have  a  small  computer  of  the  type  typically  used  to  manage  instruction 
(sec,  for  example,  Baker,  1971).  We  shall  attempt  to  demonstrate  the  feasibility  of  the 
procedure  by  briefly  outlining  the  steps  a  hypothetical  instructional  designer  would  * 

472  ...  •>••' 


A  BAYUSIAN  PROCEDURE  FOR  C RITE RION\REFE RENC ED  TESTS'  97 

take.  Let  us  suppose  that  an  instructional  designer  is  interested  in  making  decisions 
on  students'  status  with  respect  to  a  particular  set  of  program  objectives.  Test  items 
measuring  each  objective  arc  organized  into  a  criterion-referenced  test  and  ad- 
ministered to  the  stud^ts^W^  atp  biirary*~scored  and 
nxprescnlXrandom  sample  of  items  from  the  domain  of  itcmsxthat  measure  each 
objective.  For  each  objective,  the  designer  must  specify  the  number  and  the  location 
of  the  mastery  states  on  the  mastery  score  interval  (0;  1].  This  is  donexby  defining  the 
cutting  scores.  In  addition,  the  instructional  designer  specifies  the  losses  attached  to 
classifying  an  individual  incorrectly.  A  loss  matrix  of  the  kind  shown  in^Table  3  is. 
developed  and  provided  to  the  computer.  Some  rough  guidelines  for  developing  the 
loss  matrix  have  been  described  by  Hambleton  and  faovick  (1973).  Finally,  it  is 
necessary  for  the  designer  to  specify  his  prior  beliefs  about  the  distribution  of  ability 
on  each  objective  covered  in  the  test.  This  is  one  step  where  the  instructional  designer 
needs  to  be  extremely  careful.  The  effects  of -poor  choice  of  priors  on  the  decision 
process  is  not  known  at  this  point,  and  it  remains  to  be  determined  under  what  condi-N 
tions  a  poor  choice  of  priors  will  result  in  worse  decisions  than  not  using  Baycsian 
methods  at  all.  Clearly,  further  research  is  necessary  to  develop  efficient  methods  for 
accurately  assessing  prior  beliefs. 

Using  any  one  of  a, variety  of  input  devices  (i.e.,  optical  scanning  sheets,  mark 
sense  cards  or  computer  cards)  the  examinees1  test  item  responses  are  read  by  the 
computer  and  the  Bayesian  decision-theoretic  procedure  implemented.  The  computer 
prograrncan  be  designed  to  provide  the  output  necessary  to  monitor  student  progress 
through  the  instructional  program.  At  a  minimum,  a  statement  of  mastery  alloca- 
tions on  objectives  for  each  student  can  be  produced,  and  this  information  can  be 
used  to  guide  a  student  through  the  next  segment  of  his  instruction. 

The  decision-theoretic  procedure  outlined  ifj  this  paper  provides  a  framework 
within  which  Bayesian  statistical  methods  can  be  employed  with  criterion-referenced 
tests  to  improve  the  quality  of  decision-making  in  objectives-based  instructional  pro- 
grams. The  incorporation  of  losses  introduces  the  decision-maker's  values  into  the 
decision  process.  The  Bayesian  methods  incorporate  the  prior  knowledge  of  the  de- 
cision-maker and  utilize  the  data  from  all  examinees,  thereby  effectively  increasing 
the  amount  of  information  the  decision-maker  has  without  requiring  the  administrate 
tion  of  additional  lest  items.  However,  it  should  be  pointed  out  that  research  is 
needed  to  establish  the  robustness  of  the  Bayesian  statistical  model  with  respect  to 
deviations  of  the  data  from  the  underlying  assumptions.  We  also  note  that  the 
Bayesian  statistical  model  described  in  this  paper  is  only  one  of  several  models  that 
could  be  used  (for  example,  see,  Novick  &  Lewis,  1974,  for  another)  within  our 
decision-theoretic  framework.  Further  study  of  these  additional  models  would  seem 
to  be  highly  appropriate. 

REFERENCES 

BAKF.R,  F.  B.  Computer-based  instructional  management  systems:  A  first  look.  Review  of 

Educational  Research.  1971,41,  51-70. 
GLASl-R,  R.,  *  NITKO,  A.  J.  Measurement  in  learning  and  instruction.  In  R.  L.  Thorndikc 

(ltd.),  Educational  measurement.  Washington:  American  Council  on  lid uca tion,  1971. 
HAMH'LETON,  R.  K.  Testing  and  decisionmaking  procedures  for  selected  individualized 

instructional  programs.  Review  of  Educational  Research,  1 974, 44,  37 1  400. 


473 


98 


SWAM1NATHAN,  IIAMULETON,  AND  ALCINA 


i  HAMBLETON,  R.  K.,  &  NOVICK,  M.  R.  Toward  an  integration  of  thcor.  and  method  for 

criterion-referenced  tests.  Journal  of  Educational  Measurement,  1973, 10, 159-170. 
LEWIS,  C  WANG,  M.  M.,  &  NOVICK.  M.  R.  Marginal  distributions  for  the  estimation 

of  proportions  in  m  groups.  ACT  Technical  Bulletin  No.  13.  Iowa  City,  Iowa:  The  American 

College  Testing  Program,  1973. 
M1LLMAN,  J.  Passing  scores  and  test  lengths,  for  domain-referenced  measures.  Review  oj 

Educational  Research.  1 973, 43, 205-2 16. 
M1LLMAN,  J.  Criterion-referenced  measurement.  In  W.  J,  Popham  (Ed.),  Evaluation  in 

education;  Current  application^  Berkeley  .California:  McCutchan  Publishing  Co.,  1974.  ^ 
NOVICK,  M.  R.,  &  JACKSON,  P.  H.  Statistical  methods  for  educational  and  psychological 

research.  New  York:  McGraw-Hill,  1974. 
NOVICK,  M.  R.,  &  LEWIS,  C.  Prescribing  test  length  for  criterion-referenced  measurement. 

In  C.  W.  Harris,  M.  C.  Alkin,  &  W.  J.  Popham  (Eds.),  Problems  in  criterion-referenced 

measurement.  Monograph  Series  in  Evaluation,  No.  3.  Los  Angeles:  Center  for  the  Study 

of  Evaluation,  University  of  California,  1974. 
NOVICK,  M.  R.,  LEWIS,  C,  &  JACKSON,  P.  H.  The  estimation  of  proportions  in  m  groups. 

Psychometrika,  1 973, 38, 19-45. 
WANG,  M.  M.  Tables  of  constants  for  the  posterior  marginal  estimates  of  proportions  in  m 

groups.  ACT  Technical  Bulletin  Mo.  14.  Iowa  City,  Iowa:  The  American  College  Testing 

Program,  1973. 


SWAM1N ATHAN,  HARIH  ARAN.  Address:  School  of  Education,  University  of  Massachu- 
setts, Amherst,  MA  01002.  Title;  Assistant  Professor  of  Education  and  Psychology;  As- 
sociate Director,  Laboratory  of  Psychometric  and  Evaluative  Research.  Degrees:  B.  S. 
Dalhousie,  M.Ed.,  M.S.,  Ph.D,  University  of  Toronto.  Specialization:  Psychometric  theory, 
multivariate  statistics. 

HAMBLETON,  Ronald  K.  Address:  School  of  Education,  University  of  Massachusetts, 
Amherst,  MA  01002.  Title;  Associate  Professor  of  Education  and  Psychology;  Director, 
Laboratory  of  Psychometric  and  Evaluative  Research.  Degrees:  B.A.  Waterloo,  M.A., 
Ph.D.  University  of  Toronto.  Specialization:  Psychometric  theory,  evaluation  methodology. 

ALGINA,  JAMES  J.  Address:  School  of  Education,  University  of  Massachusetts,  Amherst, 
MA  01002.  Title;  Research  Associate,  Laboratory  of  Psychometric  and  Evaluative  Re- 
search. Degrees:  B.A.  University  of  Rhode  Island,  Ed.D.  University  of  Massachusetts. 
Specialization:  Psychometric  theory,  statistics. 


AUTHORS 


*8. 5    Simulation  Study  Involving  Criterion-Referenced  Test  Scores 

■  In  this  section*  we  present  a  paper  by  Harableton,  Hutten,  and 
Swaminathan  (1976).   .The  authors  compared , four  methods  for  estimating 
student  mastery;  these  methods  were  discussed  in  section  8.3.2. 

> 

< 


-39- 


A  COMPARISON  OF  SEVERAL  METHODS  FOR 
ASSESSING  STUDENT  MASTERY  IN 
OBJECTIVES-BASED  INSTUCTIONAL  PROGRAMS 


RONALD  K.  H AM B LUTON 
LEAH  fL  HUTTEN 
HAR1  SWAM  I  NATHAN 
University  of  Massachusetta 


ABSTRACT 

In  objective*based  instructional  programs  wlierc  relatively  short  criterion-iefeienced  t«U  are  administered  to  estimate  student 
mastery  for  the  purpose  of  monitoring  a  student  through  the  program,  estimates  which  maximally  utilize  the  information  that  can 
be  obtained  from  the  student  during  the  alloted  testing  time  are  required.  Rayesian  estimates,  which  titili/e  prior  information  about  the 
student,  direct  information  provided  by  the  student,  and  collateral  information  in  the  test  data  of  other  students,  appear  to  be  ideally 
suited  for  this  purpose.  In  this  paper,  the  relative  merits  of  several  methods,  Bayesian  and  classical,  for  the  estimation  of  student 
student  mastery  are  investigated.  The  effects  of  such  factors  as  group  homogeneity,  test  length,  sample  sfce,  and  prior  information  on 
the  accuracy  of  the  estimates  as  well  as  on  decision-making  accuracy  arc  studied  through  computer  simulations,  It  is  shown  that  cer- 
tain Raycsian  estimates  are  superior  to  others,  and  the  implications  of  the  findings  for  objecttvetbased  instructional  programs  are 
discussed.  *  * 


ONK  OF  THE  IMPORTANT  PK0BL1CMS  in  objectres- 
based  instructional  programs  such  us  Individualized  In- 
struction (2,  3)  concerns  the  assessment  of  student  mas- 
ten*.  In  order  to  monitor  a  student  through  an  objectives- 
based  program,  pre-testing  and  post-testing  is  done  to 
determine  mastery  on  the  specific  program  objectives  (4). 
In  theory  at  least,  these  tests  can  usually  be  made  as  re- 
liable ami  valid  as  desired  by  increasing  the  test  length. 
However,  in  practice,  the  total  amount  of  testing  time  is 
limited  and  falls  far  short  of  the  testing  lime  needed  to 
guarantee  u  high  level  of  decision-making  accuracy  on  the 
ha?is  of  test  results.  Needed  are  procedures  that  max- 
imally utilize  information  that  can  he  obtained  from  a 
student  during  the  alloted  time  for  testing.  At  present* 
there  exist  at  least  three  approaches  for  doing  this:  tail- 
ored testing  (I,        assessment  of  partial  knowledge  us- 
ing new  U>1  scoring  aiwl/or  test  administration  proced- 
ure* (I ft)'*  aM|J  assessment  of  student  mastery  using  Baycs- 
umi  fdalUtical  procedures  (5,  7,  1 1,  12,  14). 

The  possibility  of  using  Bayesian  statistical  procedures 
for  I  lie  assessment  of  student  mastery  is  particularly  ap- 
pealing because  they  require  absolutely  no  change  from  the 
u>ual  lest  administration  procedures.  Moreover,  a  careful 
examination  of  the  Bayesian  theory  suggests  that  sonic 
meaningful  gains  may  be  reali/i  d  using  Bayesian  methods 
(11).  finally,  empirical  work  with  similar !  ayesian  methods 


ERLC 


in  another  application  confirmed  the  efficacy  of  this  class 
of  Bayesian  methods  (10).  „ 

In  the  typical  objectives-based  program,  relatively  short 
criterion-referenced  tests  dre  used  to  determine  student 
mastery.  A  cutoff  score  or  threshold  score  is  set  to  permit 
the  decision-maker  to  assign  examinees,  on  the  basis  of 
their  performance  on  each  subset  of  items  measuring  an 
objective  covered  in  the  criterion-referenced  test,  into  one 
of  two  mutually  exclusive  categories-masters  and  non- 
masters.  Usually  the  proporlioiwcorreet  score  for  an  exam- 
inee on  items  measuring  each  objective  is  compared  to  the 
cuttin^score  for  the  purpose  of  decision-making.  However, 
the  method  of  using  proportion-correct  scores  as  estimates 
of  mastery  is  not  entirely  satisfactory  when  the  number  of 
items  on  which  the  proportions  is  based  is  few  and  when 
there  are  many  students.  In  situations  where  one  is  inter- 
ested ,in  estimating  many  mastery  scores,  some,  by  chance, 
will  be  substantially  overestimated  and  others  underestim- 
ated. The  implication  is  that  many  of  the  decision*  made 
on  ti.e  basis  of  test  results  will  be  incorrect,  and  this  re- 
duces the  overall  effectiveness  of  the  instrin  lional  program. 

The  Bayesian  procedure,  in  theory  a\  least,  utilizes 
additional  information  on  each  sludeiil  •hut  is  available  but 
ignored  by  iion-Baycsiau  procedures.  According  to  Noviek 
(9),  Ibis  is  done  by  using  not  only  the  direr*  infunuation 

476 


-40- 


S8 


JOURNAL  OF  EXPERIMENTAL  EDUCATION 


ERIC 


provided  l»y  n  student's  tpst  score  but  ako  using  the col- 
lateral  information  contained  in  ttie  test  data  of  other 
students  and  any  prior  information  on  the  student  that  may 
be  available. 

In  view  of  the  current  interest  in  applying  Bayesian 
methods  to  testing  problems  within  the  context  of  objec- 
tives-based programs,  the  purpose  of  the  present  investi- 
gation was  to  study,  in  a  systematic  way,  the  relative  merits 
of  several  methods  for  estimating  student  mastery.  In 
addition  to  the  proportion-correct  score,  the  classical 
Model  II  estimate  given  by  Jackson  (6),  and  Bayesian 
estimates  such  as  the  Dayesian  joint  modal  estimate  (12), 
the  marginal  mean  estimate  (7)  and  a  modification  of  it 
were  studied.  \ 

The  "quality"  of  the  various  estimates  are  dependent, 
to  softie  extent,  on  the  interactive  effects  of  factors  such 
as  group  homogeneity,  test  length,  sample  size,  and,  for 
the  Bayesian  estimates,  the  prior  information  available  on 
the  ability  of  the  examiners.  For  this  reason,  a  computer 
simulation  study  was  conducted  where  we  produced  com- 
parisons of  "true"  mastery  scores  with  "estimated"  mastery 
scores  obtained  by  the  methods  described  above  for  dif- 
ferent ability  distributions,  test  lengths,  sample  sizes,  and 
for  varying  amounts  of  prior  in  forma  Lion  on  the  examin- 
ees. 

Method 

Four  Methods  of  Estimating  Student  Mastery 

Proportion-Correct  Estimate 

The  simplest  and  the  most  obvious  estimate  of  the  ith 
examinee's  true  mastery  score,     ,  defined  as  the  pro- 
portion of  items  in  the  domain  of  items  measuring  the 
objective  that  the  examinee  can  answer  correctly,  is  his 
observed  proportion  score,  ff, .  This  estimate  is  obtained 
by  dividing  the  examinee's  test  score,  x{  (the  number  of 
items  answered  correctly),  by  the  total  number,  n,  of  the 
items  measuring  the  objective  included  in  the  test.  Ap- 
pealing as  it  may  seem  in  view  of  the  fact  that  the  pro- 
portion-correct score  is  an  unbiased  estimate  of  the  true 
mastery  score,  this  estimate,  as  mentioned  earlier,  is 
extremely  unreliable  when  the  number  of  items  on  which 
the  estimate  is  based  is  small.  For  this  reason,  procedures 
tluit  take  into  account  other  available  information  in 
orderto  produce  improved  estimates,  especially  in  the  case 
when  there  are  few  items  measuring  an  objective  in  the 
tost,  would  be  more  desirable. 

Classical  Model  11  Estimate  (Jackson) 

One  of  the  first  attempts  to  produce  an  estimate  of 
the  true  score  of  an  examinee  using  the  information  ob- 
tained from  the  group  to  which  an  individual  belongs  was 
made  by  Kelley  in  1927.  This  is  the  well-known  regression 
estimate  of  true  score  (B:65),  which  is  the  weighted  sum 


of  two  components-  one  based  on  the  examinee's  ob- 
served score  and  the  other  based  on  the  mean  of  (he 
group  to  which  the/  examinee  belongs.  Jackson  (6)  mod- 
ified  this  procedure  for  use  with  binary  data  by  transform- 
ing the  test  score  xy  and  gf  via  the  arcsine  transformation, 
known  as  the  Freeman-Tukcy  transformation, 


Br  2  <8in~\ 


*l         •  -i  /     +  1 
— —    +  sin  1  /   • 

n$l  V  »+l 


(1) 


As  a  resuttof  this  transformation,  the  true  mastery  score 
is  transformed  into^  ,  where, 

Jj  ~  sin"1  4~ffj~    •  (2) 
If  it(  is  not  too  large  or  too  small,  and  if  n,  the  number  of 
test  items,  is  sufficiently  large,  then  the  distribution  of 
is  approximately  normal  with  a  mean  approximately  equal 
to  the  transforincd^rue  mastery  Boore,?, ,  and  known 
variance  \ 

0  =  (4n  +  2)~1  . 
The  Model  II  estimate,  or  the  modified  Kelley  estimate, 
becomes,  in  terms  of  7, 

%  =  feAy  +  <*»  +       *•! 1 1*7  +       +  2)">  1  (3) 

where  g.,  the  sample  mean  based  on  a  sample  of  N  exam- 
inees, is  given  by 

and  $y  ,*the  sample  variance  of  the  7's,  is  given  by 

L-(N-  l)"1  S   (fir*.)2  -  (4n  +  2)->  .  (5) 

Once  yt  is  obtained,  #y  is  determined  from  the  expression 

tf,  =(1  +.5/w)sin27,-  -25/n.  (6) 

For  a  detailed  discission  of  this  estimate,  the  reader  is 
referred  to  Novick,  liO.wis,  and  Jackson  (12). 

Bayesian  Joint  Modal  Kstimate 

The  Jackson  estimate  given  above  is  not  the  ideal 
estimate  since  it  does  not  take  into  account  any  prior  in- 
formation that  may  be  available.  In  addition,  it  may 
happen  that  0^  estimated  using  KqiMjioii  5  is  negative, 
in  which  ease  the  solution  will  not  be  meaningful.  Novick 
et  al.  (12),  utilizing  the  transformat  on  JLand  2,  obtained 
a  Bayesian  solution  for  the  eslinmtiKlfof  the  mastery  score 
that  not  only  lakes  into  account  Ibe  direct  and  collateral 
information  but  also  any  prior  information  that  may  be 
available.  In  addition,  this  procedure  avoids  the  prohlf  in 
of  negative  estimates  for  $y  • 

477 

'    KSKfiw  IWJUBK 


/ 


HAMBLETON,  HUTTEN,  AND  SWAMINATHAN 


59 


The  Bayesian  solution  is  more  complicated  that  t!*,e 
classical  Model  II  solution  and  involves  an  iterative  pro* 
cedure.  The  Bayesian  procedure  assumes  that,  in  addition 
to  the  assumption  that  g{  is  normally  distributed  with 
mean  yi  and  variance  (4n  +  2)" 1  ,  the  transformed  true 
mastery  scores yx  %y2  , .  . . , y^  of  the  N individuals 
come  from  a  normal  population  with  unknown  mean  a 
.  and  unknown  variance  0^  •  Thus,  to  use  the  Bayesian  pro* 
cedure,  the  prior  knowledge  ahout  a  and  <t>y  have  to  he 
specified. 

^However,  Novick' et,  a).  (12) suggest  that  prior 
knowledge  ahout  a  tf  ay  not  he  as  important  as  the  specif* 
cation  of  prior  beliefs  ^hout  *v  Furthermore^  they  have 
suggested  that  it  is  reasonable  to  represent  the  prior  belief 
ahout  0<y,by  an  inverse  chUtxpiar*  distribution,  which  de- 
pends on  only  two  parameters,  V  and  X  (12).  Thus,  in 
order  to  indicate  one's  prior  belief  one  has  to  specify 
1)  and  X  (12).  (For  details  of  this  procedure,  see  12, 14.) 

Marginal  Mean  Estimate 

Hie  Bayesian  Model  II  estimate  discussed  above  is  in 
.  reality  the  joint  modal  estimate.  This  joint  estimate  is  use- 
ful for  making  joint  decisions  about  a  set  of  N  examinees, 
ffotvever,  in  criterion-referenced  testing  situations,  sep- 
arate decisions  ahout  each  individual  have  to  be  made, 
and,  hence,  separate  or  marginal  estimates  of  true  mastery 
scores  are  required. 

Lewi*,  Wang,  und  Novick  (7)  have  obtained  a  marginal 
mean  estimate  of  the  true  mastery  score,  given  by 


m      ffly  +  <4*H  ST1  a 


ERLC 


y^g'  +  p*^-  g^ 


(7) 


The  quantity  p*  is  dependent  on  the  parameters  V  and  X 
and  on  the  data;  once  the  parameters  are  set,  p*  cau  he 
read  directly  from  tables  prepared  bv  Wang  (15).  Again, 
once  y(  is  obtained,  7Tf  is  determined  using  Equation  6. 

In  obtaining  the  joint  modal  estimates  and  the  marginal 
mean  estimates,  the  above  authors  assumed  that  the  prior 
beliefs  about  a  und  0^  could  be  expressed  in  the  form  of 
distributions.  In  the  present  investigation,  it  was  felt  that 
the  effects  on  the  marginal  mean  estimates  of  specifying N 
the  prior  beliefs  about  a  and  0^  as  poitit  values  should  also 
be  studied.  To  this  end,  we  obtained  marginal  mean  esti- 
mates based  on  the  assumptions  that  (i)  the  prior  belief 
about  a  can  be  expressed  as  a  uniform  distribution,  but 
thai  <Py  can  be  specified  exactly;  arnjf  (2)  both  a  and  <t>y 
(.an  be  specified  exactly.  In  the  first  case,  it  can  he 
fchmvn  that  the  marginal  mean  esti/nate  y.  is  given  by 


07*  (4n  *2)~! 


In  the  second  case,  the  marginal  mean  estimate,  7^ , 
becomes 


/ 


(9) 


+  (<!*  1  2) 


The  similarity  between  the  marginal  mean  estimates 
given  b>  Equations  8  and  9  and  the  Jackson  estimate, 
given  by  Equation  3  U  o^vioup. 

Factors  under  Consideration 


T 


Sample  Size 

Samples  of  examinees  of  size  15,  25,  and  50  were  con- 
sidered. These  values  were  selected  because  they  seemed  , 
to  be  typical  of  the  class  sizes  that  might  be  expected  to/ 
occur  in  practice.  (Selected  results  with  sample  sizes  larger 
than  50  were  obtained,  but  they  were  essentially  the  tame 
results  obtained  with  smaller  sample  sizes.) 

Test  Length 

Tests  of  length  8, 10,  and  20  items  were  employed  in 
the  simulations.  Novick  ct  ah  (12)  recommend  the  use 
of  their  methods  when  the  test  includes  at  least  eight 
items.  Twenty  items  represents  a  reasonable  upper  limit 
on  the  number  of  items  to  be  used  to  measure  the  mastery 
of  an  objective  covered  in  a  criterion-referenced  lest. 
 / 

Homogeneity  of  the  True  Mastery  Score  Distribution 

Two  fairly  typical  distributions  of  true  mastery 
,  scores  were  considered.  To  obtain  a  homogeneous  dis- 
tribution it  was  assumed  that  the  true  mastery  scores 
were  distributed  in  the  following  way:  20%  of  the  pop- 
ulation were  distributed  uniformly  on  each  of  the  five 
intervals  defined  by  the  end  points  .46  and  1.00  and  the 
four  middle  boundary  points  .70,  .79,  .85,  and  .91.  These 
values  were  selected  to  correspond  roughly  to  a  beta  dis- 
tribution with  mean  .80  and  variance  .145.  The  beta  dis-  . 
tribution  rather  than  the  normal  distribution  was  chosen 
because  the  beta  distribution  is  defined  on  the  same  in- 
terval [0, 1  ]  as  the  true  mastery  scores,  whereas  the 
normal  distribution  is  not 

To  obtain  a  heterogeneous  distribution,  we  assumed 
that  20%  of  the  population  were  distributed  uniformly 
in  each  of  the  five  intervals  defined  by  the  end  points  .30 
and  1.00  and  the  four  middle  boundary  points,  .60,  .70,.. 
.CO,  and  .90. 

Specification  of  Priorlnformation 

An  integral  part  of  the  process  of  utilizing  Bayesian 
methods  is  the  specification  of  one's  prior  belief  about 
the  distribution  of  true  mastery  scores.  Unfortunately, 
the  importance  of  Hi  specification  of  a  prior  uYuler  vary- 
ing testing  conditions,  for  example,  tests  of  different 
lengths  administered  to  different  numbers  of  examinees, 
is  unknown.  In  our  study,  since  the  distribution  of  true 


/ 


478 


/ 


60 


-42- 

JOURNAL  OF  EXPERIMENTAL  EDUCATION 


ERIC 


mastery  scores  in  the  population  was  known,  it  was  pos- 
sible to  investigate  tlie/ef fects  of  specifying  prior  informa- 
tion on  the  various  Itarycsfciu  estimates.  Since  the  speeifi- 
cation  of  prior  information  depend*  upon  fitting  the 
value  of  $y  f  we  studied  the  effects  of  setting  four  different 
values  of  4y  on  the'Bayesian  joint  modal  and  Bayesian 
marginal  mean  estimate*.  The  four  situations  were:  an 
accural*?  value  of  Py  based  on  the  distribution  of  trans- 
formed true  mastery  scores;  a  large  value  for  0»  ;  a  small 
value  for  0>  ;  and  a  value  for  $y  derived  from  trie  data. 

In  addition,  with  the  modified  mean  estimate,  we 
generated  four  variations  on  the  setting  of  prior  informa- 
tion. We  set  <py  to  be  one  of  two  values:  an  accurate  value 
of  0,y  based  on  the  distribution  of  transformed  true 
mastery  scores  and  a  small  value  for  0^  .  Also,  a  was  set 
to  be  one  of  two  values:  a  value  based  on  the  examinee's 
simulated  performance  with  true  mastery  score  n  on  one 
previous  test  or  on  five  plrevious  tests. 

i 

Cutting  Score 

In  this  study  the  cutting  score  was  set  to  be  .80.  This 
value  of  the  cutting  score  is  often  observed  in  practice. 

Testing  the  Fit  ofithe  Data 

Two  criteria  we^e  used  to  test  the  goodness  of  fit  be- 
tween the  various  estimates  and  the  true  mastery  scores. 
These  criteria  are  the  loss  functions  based  on  (a)  the 
average  absolute  difference  and  (6)  decision  accuracy. 
These  two  were  selected  because  they  seemed  to  be  the 
most  relevant  within  ttje  context  of  criterion-referenced 
testing  problems. 

Average  Absolute  Difference  (AAD) 

This  loss  function  isjbased  on  thrf  average  absolute 
difference  between  the jestimates  and  the  true  mastery 
scores,  and  is  given  by  tjhe  expression 

i 

N  ' 

AAD  =  2    \ni->ni\/N  . 
/=  i    1  l  ' 

1 

Decision  Accuracy  I 

The  second  measure  of  goodness  of  fit  used  in  our 
study  was  the  proportion  of  correct  and  incorrect  de- 
cisions arrived  at  using  1  he  various  estimates  of  mastery 
scores. 

When  an  examinee's  true  mastery  score     exceeds  itQ  , 
where  itQ  is  the  point  on  the  mastery  score  scale  used  to 
separate  examiners  into  mastery  and  non-mastery  states, 
the  examinee  is  considered  a  true  master.  Likewise,  when 
the  true  masl**ry  score  is  below  the  cutoff  score,  the  exam- 
inee is  considered  a  true  non-master.  Since  in  practice 
these  true  mastery  scores  are  not  known,  the  allocation 
of  examinees  to  mastery  states  is  based  upon  observed 
scores,  or  the  estimates  it.  of  the  true  mastery  scores  n.  . 
Agi  in,  if  the  estimated  true,  mastery  score  ni  exceeds 


it  ,  the  examinee  is  classified  as  a  master,  and  a  non-mas- 
ter  otherwise, 

In  order  to  investigate  llie  goodness  of  fit  Tor  the  var- 
ious estimates  of  the  true  mastery  score  and  to  study  how 
these  estimates  affect  the  classification  of  examinees,  we 
define  a  variable  Y{  such  thai 


if 


and 


T.  =  0,  if 


vi>vo 


ni<n0 


If  we  obtain  an  estimate  7^  of  fly ,  then  wc  can  define  the 


'estimate"  of  Y(  as 


and 


?, « o, 


if 


if 


ttj  >n€ 


at<tt0 


The  error  of  estimation  can  then  be  defined  as 
=  (Yf  -  Y{).  Obviously,  the  error    can  take  on  one  of 
three  values,  -1,  0,  +1.  When  an  examinee  is  classified 
correctly,  ci  =  0.  When  a  false  positive  error  is  commited, 
that  is,  when  an  examinee  who  is  a  true  non-master  is 
classified  as  a  master,  cf.  -  1.  Similarly,  if  a  false  negative 
error  is  committed,  that  is,  when  a  true  master  is  classi-  / 
fied  as  a  non-master,  e^  «  - 1.  We  then  define  decision  ' 
accuracy  as 


N 
2 

/=  1 


1-  2  e*/N% 


the*  ratio  of  the  number  of  correct  decisions  to  the  total 
number  of  correct  decisions  to  the  total  number  of  de- 
cisions made. 

In  passing,  it  should  be  pointed  out  that  the  false 
positive  and  false  negative  errors  can  be  weighted  differ 
cntially,  or  in  other  words,  various  costs  of  rniscla^ifi- 
cation  can  be  introduced  (5).  In  our  study,  losses  were 
weighted  equally,  although  it  should  be  recognized  that 
in  many  applications  different  losses  will  he  used. 

Simulating  the  Test  Data  j 

>     The  first  step  in  the  simulation  of  t/:st  data  was  to 
specify  the  number  of  examinees,  the/ test  length,  and 
the  true  mastery  score  distribution.  The  next  step  was  to 
generate  a  true  mastery  score  for  each  examinee.  This 
was  acconinlished  by  selecting  a  number  at  random 
from  the  true  mastery  score  distrijfiuliHn. 

The  third  step  was  to  generate  a  sample  mastery  or 
proportion-correct  score  for  eaofi  examinee.  The  simula- 
tion of  test  data  was  accomplished  Using  the  binominal 
test  model  (8:508).  Since  the  irue  mastery  score  it  for 


473 


haMbleton,  hutten,  and  swaminathan 


t 


Table  l.^Goodness  of  Vh  Results1  Baaed  on  the  Average  Absolute  Deviation  Measure  for 
*  25  Simulations  of  Eaah  or  Thieo  Sample  Sizes  In  which  Homogeneity  of  the  Ability  Di* 
tribuUon,  Numt*r  of  Test  Items  and  Prior  Information  Vtore  Varied  


'•Ability  Dl 

itrlbutlon 

Estimate 

• 

tiecero^uneuue 
Tteat  Length 
8  10 

20 

Teat  Length 
8  10 

20 

Proportion-'CorracC 
Jackson 

.112 
.092 

.098 

V 

.085 

.075 

* 

.069 

.103 
.080 

.093 
.075 

.067 

itt 

.056 

Modlf ltd  Marginal 
Mean  -1 

-2 

-3 

-4 

.085 
.078 
.094 
.056 

.077 
.067 
.083 
.047 

.057 
.054 
.051 
.037 

.082 
.060 
.097 
.049 

.073 
.053 
.087 
.043 

.050 
.047 
.050 
.032 

Bayaaian  Joint 
Hoda^   ,  -1 

-3 
-4 

.111 

.133  • 
.092 

.100 
^  :098 
.127 
.081 

*p72 
.066 
.074  ' 
.063 

,  .093 
.089 
.103 
.074 

.087 
.085 
.103 
.069 

.062 
.061. 
.087 
.056 

Bayaaian  Marginal 
Mean      ,  -1 

1  :\  ( 

-4 

.093 
.090 
.099 
.093 

.086 
.084 
.069 
.081 

.070 
.064 
.066 
.063 

.081 
.076 
.085 
.076 

.076 
.073 
.081 
.071 

.057 
.057 
".060 
.056 

1 


Smaller  values  indicate  better  estimates* 


the  examinee  represented  the  probability^  correctly 
answering  any  ikm,  it  was  possible  to  convert  the  prob- 
ability into  item  lore's  (i.  e.,  I  for  a  correct  response 
and  0  for  an  ineortrct  response)  by  comparing  n  with 
random  numbers  seleeted'from  a  uniform  distribution  , 
on  the  interval  [0,/l].  If  the  random  number  was  leas 
than  or  equal  to  d  the  examinee  was  credited  with  a 
correct  responsej/otherwise,  the  exurninee  was  credited 
with  an  meorreel  response.  This  process  was  repeated  n 
time*  (fnr  n  items)  and  a  test  score  for  the  examinee  was 
obtained.  This  tent  score  was  then  converted  to  a  pro- 
portion-correct score,  and  the  procedure  was  repeated 
for  each  of  the  N  examinees. 

Once  the  proportion-correct  scores  for  the  sample 
of /V  examinee*  on  n  items  were  obtained,  "improved1' 
estimates  were  obtained  by  the  methods  described  ear- 
lier. Kuril  set  of  estimates  derived  from  the  different 
methods  for  the  three  test  lengths,  three  sample  sizes, 
and  from  the  two  ability  distributions  was  then  com- 
pared with  the  "trim"  values.  (loodnessof  fit  measures 
di-M  ribed  earlier  wen-  used  to  assess  the  appropriateness 
of  each  m  *.  of  estimates.  To  improve  the  stability  of  the 
•goodness  of  fit  measures,  2fi  replications  were  con- 
ducted for  each  set  of  test  conditions.  We  reported  the 
mean  goodness  of  fit  measures  across  the  25  replications. 


480 


Results 

In  Tables  1  and  2,  we  have  reported  the  results  of  our 
simulations.  The  entries  in  the  tables  require  some  ex- 
planation. The  modified  marginal  mpan  estimates  1 ,  2> 
3,  and  4  were  obtained  using  Equation  9.  These  four 
estimates  differ  with  respect  to  the  specification  of  aarid 
0  .  In  estimates  1  and  3,  a  was  set  equal  to  sin  "  IJfl^ 
where  fr  was  an  estimate  of  the  ith  examinee's  trim  ^ 
mastery  score  derived  from  simulating  his  performance 
on  one  previous  test  occasion.  For  estimates  2  and  4, 
the  same  procedure  was  followed  with  one  exception: 
rii  was  the  examinee's  average  test  performance  on  five 
previous  teats.  With  estimates  1  and  2, 4>y  was  set  equal 
to  the  variance  of  7,  obtained  from  the  true  distribution 
of  7  .  For  estimates  3  and  4,  the  same  procedure  was 
followed  with  one  exception:      was  set  to  be  Y%  of  the 
variance  of  7  . 

/  Since  the  effects  of  prior  information  on  the  Baycsiau 
joint  modal  and  the  marginal  mean  estimates  were  to  be 
investigated,  we  chose  to  consider  four  different  priors. 
The  prior  information  for  the  two  Hayesiaii  estimation 
procedure  is  summarized  by  specifying  the  two  param- 
eters, v  and  X,  Noyick,  Lewis,  and  Jnekson  (12)  recom- 
mend that  it  is  appropriate  in  most  situations  to  set 
v  -  0  and  X  =  60^  .  Hence,  varying  the  prior  information 

BEST  ®  ■ 


JOURNAL  OF ^£tUM|MTAUSDUG^M-i 


^^^^^ijloiiiiW  t<>  setting  different  values  for  4>y  .  In  Our  study, 
.^^^^^:^'MKte"I^y*»i«ii  jitiiik  iikkYM  and  the  Buyesiun  marginal  tnean 
J'^l^^^^iawiw-i  i  2,  3,  and .4  were  obtained  by  selling .0L  equal. 


Increasing  tliti  tost  leiigth  ifli|irovU«t  both  tin*  gOoditeft* 
s  ^  #  <>f  fit  irl^n^iireK;  lujwvv^r!  th«  iiii|m>V<uiuiiit»  \viir«*  lin'Ml^t, 

I ri3^ rcjuUl  to  the  variance  of  the  transformed  ohscrv<$d  pro- 
V.;|^L^^  portion  scores  in  the  sample  of  examinees.  With  the 
;^^^^"»B8ttHii  estimate,  0^  was  taken  to  he  the  variance  of 
^         derived  from  Kquatiou  2.  In  other  words,  0^  was  set  equal 
■Wlr*       to  the  variance  of  the  transformed  true  mastery^eores. 


Y 


For  the  third  estimate,  $y  was  set  to  he  J4  times  »«>  large 
its  the  variance  of  the  transformed  true  mastery  scores. 
For  the  fourth  estimate,  <py  was  set  to  be  4  times  the 
variance  of  the  transformed  true  mastery  scores.  These 
last  two  estimates  wefe  introduced  so  that  we  could  ».udy 
the  effects  of  a  prior  based  on  a  distribution  thai  was 
either  too  homogeneous  or  too  heterogeneous  relative  * 
to  the  distribution  of  mastery  scores  in  the  sample  of  ex- 
aminees. 

Discussion  1 
Sample  Size 

Since  the  results  of  our  simulation  study  varied  only 
slightly  as  a  function  of  sample  size,  we  chose  to  simplify 
the  presentation  of  results  by  reporting  average  goodness 
of  fit  measures  across  the  three  sample  sizes  for  the  re- 
maining six  combinations  of  test  lengths  and  true  mastery 
score  distributions. 


an  IV/o  improvement  in  decision-accuracy  and  a  debase 
of  about  i027  in  the  average  absolute  difference  index 
when  the  test  length  was  increased  from  0  to  20  items. 
This  suggests  that  0  items  represents  a  sufficient  basis 
on  which  to  assess  student  mastery  or  to  make  instruc- 
tional decisions  from  criterion-referenced  tfcst  data. 

Group  Homogeneity 

The  group  homogeneity  had  some  interesting  effects 
on  the  goodness  of  fit  measures.  The  goodness  of  fit 
measure  based  on  the  average  absolute  deviations  in- 
dicated that  the  estimation  procedures  were  more 
effective  with  the  homogeneous  group  than  with  the 
heterogeneous  group.  This  perhaps  can  be  explained  when 
one  realizes  that  the  more  homogeneous  the  distribution 
of  ability,  the  more  valuable  the  group  mean  is  for  the 
estimation  of  an  individual's  mastery  score.  c 

However,  in  terms  of  decision-making  accuracy,  the 
situation  was  reversed.  The  decision-making  accuracy  was 
belter  for  a  heterogeneous  ability  distribution  than  for  a 
homogeneous  distribution.  An  explanation  for  this  is  as 
follows:  In  a  homogeneous  distribution  where  the  true 
mastery  scores  are  concentrated  near  tne  mean,  and  in 


ii 


Table  2,-Goodness  or  Pit  Results1  Based  oh  a  Decision  Accuracy  Measure  for  25 
Simulitions  of  Each  of  Thiee  Sample  Sb.es  in  which  Homogeneity  of  the  Ability  Distri- 
bution, Number  of  Test  Items,  and  Prior  Information  Were  Varied 


J 


Ability  Distribution 


Estimate 

Heterogeneous 

' Homogeneous 

8 

Teat'  Length 
10 

20 

8 

Test  Length 
10 

20 

Proportion-Correct 

.627 

.801 

•  858 

.763 

.776 

.819 

Jackson 

.607 

.824. 

.860 

.737 

.754 

.794 

Modified  Marginal 
Mean  -1 

-2 

-3 

-A 

.631 
.836 
.621 

.847 
.865 
.837 
.905 

* 

.864 
.86? 
.903 
.922 

.815 
.853 
.794 
.8oV 

.822 
'  .840 
.807 
.858 

.831 
.868 
.839 
.906 

Bayeslan  Joint 
Modal  -1 
-2 
-3 
-4 

.730 
.792 
.654 
.793 

.782 
.782 

•  642 

•  829 

.853 
.849 
.843 
.859 

,656 
.700 
.622 
,776 

^696 
.738 
.625 
.744 

.800 
.802 
.691 
.808 

Bayeslan  Marginal 
Mean  -1 

-2 

-3 

-4 

.601 
.803 
.765 
.793 

.821 
.811  ' 
.806 
.829 

.858 
.849 

.856 
.859 

.725 
.738 
.744 
.780 

.758 
.761 
.734 

.737  <S 

** 

.802 
.806 
.788 
.809 

# 


ERIC 


'Larger  values  indicate  better  estimates. 


481 


best  mmm 


v.'  -  C".-.vV.cr. 


our  study  near  the  cutting  score,  more  incorrect  decisions 
are  bound  to  occur.  This  is  because  all  the  estimates,  with 
J&V\         the  exception  of  the  proportion  correct  act  ireij. weight 
^fcss^mfcifthe  observed  score  by  the  group  mean,  and  even  the  -'  ,, 
v  slightest  change  in  either  direction  of  the  cutting  score 

%  would  make  the  estimate  unstable  in  terms  of  decision* 

making  accuracy.  This  occurs  to  a  lesser  e.\t«nt  in  hete.ro- 
g«n«ous  distributions  because  of  the.  spread  of  scores  that 
exists. 


I*rior  Information 

Specification  of  prior  information  was  requited  only 
for  the  Bayesian  estimates.  We  considered  the  effects 
of  varying  priors  on  all  three  Bayesian  estimates;  namely, 
the  modified  marginal  mean  estimate,  the  joinWnodal 
estimate,  and  the  marginal  mean  estimate.  Since  setting 
the  prior  for  the  modified  marginal  mean  eBtimate-re- 
quired  a  procedure  that  was  different  from  that  required 
by  the  joint  modal  and  the  marginal  mean  estimates,  we 
shall  consider  them  separately. 

An  examination  of  the  entries  in  Table  I  indicates  that 
for  the  Bayesian  joint  modal  estimate  and  the  marginal 
mean  estimate,  in  general,  estimate  4  produced  the  best 
results,  followed  by  estimates  2, 1,  and  3.  For  estimate  4, 
the  prior  information  was  based  on  a  value  of  the  variance 
that  was  four  times  the  true  variance,  while  for  estimate 
3  the  prior  information  was  based  on  a  value  of  the  var- 
iance that  wa*  (  25)  times  the  true  variance.  Since  es- 
timate 1  was  based  on  the  sample  variantfe,  we  shall  not 
he  concerned  with  it  for  the  present.  Estimate  2  was 
based  on  the  variance  of  the  true  mastery  score  distri- 
bution; in  our  case,  it  was  taken  to  be  .04. 

The  expression  for  the  marginal  mean  estimate  is  given 
by  Equation  7.  The  quantity  p*  in  Equation  7  increases 
with  X/u  ,  and  when  p*  is  large,  little  weight  is  given  to 
the  group  mean  and  vice  versa.  Thus,  when  <t>y  was  set 
equal  to  .16  (  =  4  X  .04),  the  value  of  p*  is  large  and  there 
is  little  or  no  regression  towards  the  mean  of  the  group. 
Hence,  the  estimate  for  7f  is  close  to  the  mean,  or  the  true 
bcore,  of  that  examinee.  Since  our  criterion  is  based  on  the 
deviation  of  the  estimated  mastery  score  from  the  true  . 
score,  a  smaller  value  for  error  would  bo  obtained  in  this 
case.  A  small  value  for  X/u-as  that  obtained  when  for 
estimate  3,  </>7  was  set  equal  to  .01  (  =  .25  X  .04)-results 
'  in  considerable  regression  towards  the  mean  of  the  group. 
If  an  examinee's  tr,ue  score  is  far  from  the  mean  of  the 
group,  a  larger  value  for  error  would  be  obtained.  This 
fact  is  borne  out  by  the  fact  that  in  the  relatively  homo- 
geneous distribution,  the  errors  are  relatively  small.  A 
similar  explanation  is  valid  for  the  joint  modal  estimates. 
In  general,  prior  information  had  little  effect  on  the  es- 
timates as  lesf  length  increased. 

The  modified  marginal  mean  4,  on  the  other  hand, 
produced  heat  results  when  ^  was  set  equal  to  .25  of 
the  variance  of  the  true  distribution  of  y  and  when  a  • 


AND  S.VVAMl^ATHAN:,,,:,,;-..:-^..,v..-. .... 

was  set  eijual  to  iJui  exii i jtineeV  average  score  cm  fivfc 
previous  tests,  The  reason  for  this  is  obvious  when  \vc  ex- 
amine Kijuation  9.  Whenfty  is  sinal^u  c^ 

-ity  of  the. test  is  low,  th^catimaje  weights  a  ratherJieavily^ 
Tn  tins  case,  an  a  based  o»  the  examinee's  five  previous 
test  scores  is  an  extremely  gtfdd  initial  estimate  of  the  ex- 
aminee's true  score,  and  hence  we  obtained  excellent  r: 
results  with  this  estimate. 

In  summarising,  we  note  that  in  the  present  context, 
the  Bayesian  joint  modal  and  marginal  mean  estimates  are 
less  affected  by  the  prior  information  than  the  modified 
marginal  mean  estimates  unless  a  really  bad  prior  is 

•  specified.  If  exact  and  accurate  values  for  a  and  <t>y  can 
be  specified,  then  the  modified  marginal  mean  estimate* 
produce  tlieliest  results.  Since  the  modified  marginal 
mean  estimates  are  sensitive  to  priors,  care  should  he  ex- 
ercised in  using  them. 

Comparison  of  the  Estimates 

'    The  results  indicated  that  all  the  estimates  fared  far 
better  than  the  proportion-correct  score  in  terms  of  the 

*  average  absolute  deviation  measure.  The  best  estimates 

'  were  obviously  the  modified  marginal  mean  estimates. 
The  other  estiinutes,  in  order,  are:  (I)  the  marginal  mean 
and  Jackson  estimates;  (2)  the  joint  modal  estimate  bused 
on  a  good  prior;  (3)  the  proportion-correct  scqre;  and 
(4)  the  joint  modal  estimate  based  on  a  had  prior. 

In  terms  of  decision-making  accuracy,  the  results  were 
less  clear  cut.  The  modified  mean  estimates  were  again  the 
best.  The  proportion-cqrrect  score;  the  Jackaou,  anil  the 

.  marginal  mean  estimates<faHowed,in  that  order.  TI^|oint_ 
modal  cstiuuttes,  though  n9t  too  far  behind,  produced  tin; 
poorest  results.  & 

Possible  explanations  for  the  poor  results  obtained  with 
the  joint  modal  estimate  were  given  in  the  previous  sec- 
tions. The  explanation  is  that  the  joint  modal  estimates 
are  strictly  intended  for  making  joint  decisions  about  all 
the  examinees.  However,  our  criteria,  the  average  absolute 
deviation  measure  and  the  decision-accuracy  measure,  are 
both  bused  on  the  deviation  of  individuals  from  their  true 
mastery  scores,  and  hence  are  biased  against  the  joint 
modal  estimate. 

? 

Summary  and  Suggestions  for  Further  Research 

In  summary,  the  results  of  our  simulation  study  com- 
paring several  methods  for  estimating  student  mastery 
in  a  variety  of  testing  situations  were  rather  revealing. 
Specifically,  the  classical  Model  II  estimate  and  the  hay- 
esian  estimates  tended  to  produce  belter  results  than  the 
proportion-eorreet  estimate,  and  we  obtained  better 
results  with  the  Huyesiaii  estimates  when  the  distribution 
of  true,  mastery  seores  was  homogeneous.  Also,  we  noted 
that  test  length,  unci  amount  of  prior  information,  bail 
only  minor  effects  on  the  "quality"  of  the.  estimates.  The 
one  exception  occurred  with  the  Bayesian  modal  estimat* 


4*? 


64 


-46- 

JOURNAL  OF  EXPERIMENTAL  EDUCATION 


v       « • 

fur  which  tin*  prior,  8ct  on  tlit*  true  tn;wtery  score  <l[a- 
trihution,  wuh  too  homogeneous.  In  this  situation  the 
results  were  quite  poor. 

In  comparing  the,  estimates  we  noted  that  the  mod* 
ified  marginal  mean  estimates  tended  to  he  "best,"  but 
this  reault  i»  somewhat  misleading  since  in  practice  one 
would  seldom  have  aa  good  a  prior  estimate  of  the  exam* 
inee*B  level  of  mastery  and  the  distribution  of  true  mas* 
tcry  scores  as  we  used  in  the  study. 

Although  the  results  of  onr  simulation  study  revealed 
only  modest  improvements  with  Bayesian  methods  under 
the  conditions  studied,  we  are  not  prepared  to  discourage 
the  use  of  Huyesian  methods  with  criterion-referenced  test 
data.  Quite  the  contrary,  since  with  the  availability  of 
a  small  computer,  any  improvement  in  the  estimates, 
however  modest,  is  worth  obtaining,  especially  as  these 
can  be  obtained  with  very  little  cost  and  effort. 

It  should  b?  mentioned  also  that  Bayesian  methods  can. 
he  used  to  produce  a  posterior  distribution  on  the  unknown 
true  mastery  score  for  an  examinee  which,  when  incorpor- 
ated with  a  loss  structure,  provides  a  basis  for  decision- 
making (7,  14).  Our  results  did  indicate  that,  on  the  aver- 
age, better  point  estimates  are  obtained  from  the  Bayeiian 
methods.  Hence,  Bayesian  procedures  do  provide  a  better 
basis  for  getier  ting  a  probability  distribution  to  repre- 
sent outJHief  about  thejocation  of  the  unknown  true  » ,„ 
mastery  score  for  the  individual. 

Also,  we  believe  that  as  users  become  more  adept  at 
slating  prior  beliefs  about  examinee*'  level  of  mastery, 
and  if  the  Bayesian  modal  estimate  is  avoided  in  situa- 
tions when  individual  descriptions  or  decisions  are  re- 
quired, even  better  results  than  those  reported  in  this 
paper  will  be  obtained.  We  should  add  that  we  would 
expect  the  decision-accuracy  associated  with  Bayesian 
methods  to  improve  in  situations  where  there  are  more 
than  two  mastery  states  (14). 

In  terms  of  further  research,  we  think  it  highly  de- 
sirable to  explore  the  possibility  of  applying  the  Bayesian 
methods  to  tests  with  fewer  than  eight  items.  This  would 
make  the  methods  applicable  to  many  more  testing  sit- 
uations than  is  possible  now.  Also,  it  is  with  the  shorter 
tests  that  improvements  on  the  proportion-correct  estim- 
ates are  most  needed. 


NOTE 

1.  Without  in  any  way  implying  their  endorsement  of  the 
research  methodology  used  in  the  study  or  the  results,  the  authors 
would  like  to  acknowledge  their  gratitude  to  James  Algina,  Paul 


Jackson,  and  Mglvin  Novlck  foi  constructive  criticisms  ami  helpful 
comments  on  anfe^jlier.  draft  of  the  manuscript.  Ming-Mei  Wang 
kindly  provided  \\%  with  a  computer  program  to  compute  the 
marginal  mean  estimates. 


REFERENCES 

1.  Ferguson,  R.  L.  The  development,  implementation,  and  eval- 
uation of  a  computer-assisted  branched  test  for  a  program  of 
Individually  prescribed  instruction.  Unpublished  doctoral 
dissertation,  University  of  Pittsburgh,  1969. 

2.  Glaser,  R.  Instructional  technology  and  the  measurement  of 
learning  outcomes.  American  Psychologist.  1963; /&,  519-52 1. 

3.  Glaser,  R.  Adapting  the  elementary  school  cuiriculum  to 
individual  performance.  In  Proceedings  of  the  196?  Invitational 

*  Conference  on  Testing  Problems,  Princeton,  N.  J.:  Educational 
Testing  Service,  1968. 

4.  Hambleton,  R.  K.  Testing  and  decision-making  procedures 
for  selected  individualized  instructional  programs.  Review  of 
Educational  Research,  1974,44,371-400.  - 

5.  Hambleton,  R.  K.,  &  Novlck,  M.  R.  Toward  an  integration  of 
theory  and  method  for  criterion-referenced  tests.  Journal  of 
Educational  Measurement,  1973, 10, 159-170. 

6.  Jackson,  P.  H.  Simple  approximations  in  the  estimation  of 
many  parameters.  British  Journal  of  Mathematical  and  Statis- 
tical Psychology,  1972,23,213-229. 

7.  Lewis,  C,  Wang,  M.  M.,  &  Novick,  M.  R.  Marginal  distributions 
for  the  estimation  of  proportions  in  m  groups.  Psychometrika. 
1975,40,63-75. 

8.  Lord,  F.  M.,  &  Novick,  M.  R.  Statistical  theories  of  mental 
lest  scores.  Reading,  Mass.:  Addison-Wesley,  1968. 

9.  Novick,  M.  R.  Bayesian  considerations  Id  educational  infer* 
mation  systems.  In  ProWdingsofthe  1970  Invitational  <?on> 
ference  on  Testing  Problems.  Princeton,  N.  J.:  Educational 
Testing  Service,  1971. 

10.  Novick,  M.  R.,  Jackson,  P.  H.,  Thayer,  D.  T.,  &  Cole,  N.  S. 
Applications  of  bayesian  methods  to  the  prediction  or  educa- 
tional performance.  The  British  Journal  of  Mathematical  and 
Statistical  Psychology,  1972, 25,  33-50. 

11.  Novick,  M.  R.,  &  lewis,  G, -  Piescriblng-leU-length-for  criterioiu- 
referenced  measurement.  InC  W.  Hatris,  M.  C.  Alkin,  &  W.  J. 
Popham  (Eds.),  Problems  in  cHterion-refertnced  measurement 
(CSE  Monograph  Series  in  Evaluation,  No.  3).  Los  Angeles: 
Center  for  the  Study  of  Evaluation  University  of  California, 
1974. 

12.  Novick,  M.  R.,  Lewis,  C.,&  Jackson,  P.  H.  The  estimation  of 
proportions  in  m  groups.  Psychomehika,  1973, 38, 19-46. 

13.  Spineti,  J.,  &  Hambleton,  R.  K.  A  computer  simulation  study 
of  tailored  testing  strategies  for  objectives-based  instructional 
programs.  Educational  and  Psychological  Measurement, 

in  press. 

14.  Swaminathan,  H.,  Hambleton,  R.  K.,  &  Algina,  J.  A  bayesian 
decision-theoretic  procedure  for  use  with  criterion-referenced 
tests.  Journal  of  Educational  Measurement.  1975, 12,  87-98. 

15.  Wang,  M.  M.  Tables  of  constants  for  the  posteriori  marginal 
estimates  of  proportion)  in  m  groups,  A  Cf  Technical  Bulletin 
No.  14.  Iowa  City,  la.:  The  American  College  Testing  Program, 
1974. 

16.  Wang,  M.  D.,  &  Stanley,  J.  C.  Differential  weighting:  A  review 
of  methods  and  empirical  studies.  Review  of  Educational 
Research,  1970, 40.  663-705. 

v. 


483 


-47- 


8.6    Reporting  of  Test  Score  Information  *  '       "  *  " 

In  this  section,  we  will  discuss  some  procedures  for  reporting 

0 

individual  and  group  data  from  criterion-referenced  tests.    The  examples 

included  in  each  sub-section  exe  based  upon  the  involvement  of  one  of  the 

authors  wdth      criterion-referenced'  testing  programs  in  several 

school  systems.  The  examples  provide  a  look  at  a  number  of  ways  of 

reporting  individual  and  group  test  scj^re  information^ 

For  the  reader  interested  in  a  further  discussion  of  how  to  report 

test  score  information,  the  books  by  Gronlund  (1974,  1976)  provide  an 
if 

excellent  review  of  practical  procedures, 

&  8,6.1    Individual  Test  Score  Information 

First,  we  will  provide  some  examples  of  how  to  report  individual 
test  score  information,  and  then  we  will  discuss  an  alternative  method  of 
6 tion  that  does  not  involve  the  usually  reported  .percentage  correct  by  obj< 

The  examples  that  follow  present  test  results  of  three  students 
on  three  tests.    The  data  presented  is  the  percentage  correct  score  for 
/each  objective  on  each  of  the  three  tests, 'where "the  tests  have  a  varying 
number  of  objectives.   Also,  data  is  presented  on  the  average  performance 
across  the  objectives  and  the  percent  of  the  objectives  mastered  across 
three  test  occasions*    Two  comments  follow  from  a  perusal  of  the  data: 
(1)  This  data  output  for  an  individual  student  provides  an  excellent 
breakdown  of  performance,  and  is  highly  useful  for  decision-making  on 
an  individual  basis.     (2)  A  reasonable  way  to  present  individual  data  is 
by  using  a  percentage  correct  score.     In  reference  to  comment  two,  we 

-» 

will  now  discuss  an  alternate  method  for  presenting  individual  test  score 
data. 


484 


Green  VaHey— Title  I 


SCHOOL  -  Green  Valley 


GUADE 


DATE  -June  1977 


V  %•  

TOTA'.  Nl^ET*  0P  TESTS  AWH^TfaESI 


RATE  s       jUKf  IQ77  rnilGWUO  T?STIMS 

HACKG^OUrO  OAT.A   s  STSTA      f.  SE  V         2   £THM (,  3 


=  N/A 


TFSTS/onjfCTTV^S 

i 

2 

3 

<• 

5 

ft  7 

r 

9 

wi«>n  attack  trv*"L  3 

6  7 

59 

i::  ice 

70 

12  0 

ATTiC<  U£VpU  <• 

ion 

ion 

7S 

A9 

67  flA 

100 

/6 

o:r  T:c*iA-»v 

(  5 

130 

an 

11 

11 


i« 


59* 


AA  100 


on JFCTIV5S 

13 

5 


TEST CCGASIOM 
*I»ST 


P-RC 


AVFRAGf  PFRFflt-MArtCF  ON  OOJ^CTIVCS 
WA-i     KA-1K     WA*2     MA*3     WA-<*     OXCT  P^CMP 


PrRCTNT  Of  OejECTIVrS  MASTEREO 
P-Rf)    WA-1     WA-1M     WA*2     WA-3     WA-*»  0ICT 


MAY*  Mf*H 

107   

*A  /I 


6* 


77 


65 


10 


U6 


6C 


*  •       .  i  J 
t? 

t  <;pc 

< 

69 

Q 

91 

51 

AO 

77  <»0 

ST'^'sT  ;n  -  i  <6  5 

;9v36     0     STUOKNT  »< 

AME  a 

riATF  =       JUNE  1<?77 

FI5LL0WUP 

TEST  TNG  =  N/A 

* 

TOTAL  *UW>  OP  TESTS 

A0riXNlSTF»EO>  *3 

BACKGROUND  DATA  *  STSTA 

0  ? 

f-TH\:  3 

3 

SUM>< 

ARY  ,OF 

(>EPC£NTAGr  SCOPES   (°S)    *>V  ORJCCTlVi- 

N'JM"FR  OF 

MA  XTM!]M 

T^T^/onjf  CTlyTS 

12  3 

U 

5 

A  7 

9       10       11      12  13 

!<•      *1<5  OnjrCTlVE*; 

'^ccrf 

WO'  P   A^TICr  LfV^'L  3 

17       d7  100 

7r, 

63  100 

.CO 

A6  67 

1* 

10? 

W.ar.   ATTACK  LEV^L 

ICQ    ICC  sr3 

A3 

67  ,100 

A3 

49     130     lUt     ICC  0J 

13 

10  S 

11C     100  10C 

39 

6ii 

36 

• 

■ « 

V 

tfst  ttcasi'v* 

AVERAGE  PrVO^ANOE  ON 

O^JFCTIVCS 

PERCENT  OF 

onjFCTiws 

MASTJ-'OEO 

p-c?0 

WA-1     UA-1M  WA-2 

w  A-3 

W  A-^  • 

oxer  p- 

CMP 

P-RO     WA-1  WA-1M 

WA-H 

wa-i,   mcT  r 

•  thp 

F!*ST 

• 

75 

83 

51 

SITCOM*) 

ft  / 

« 

TM  l»0 

AA 

a  70 

AC' 

77  tiC* 

• 

****  •••••••  • V • «  4 

STl^-'.t  jz  =  ;  ip>    $  C96iV  o    Stustnt  nahe  = 


DATE  s       JUN'F  1977  FOLLOHUD  Tr STf NG  = 

OACKG^OUNO  ^  AT  A  =  STSTA     0  Sf^         1   ETh^C  3 


N/A 


SUMMARY  OF  PERCENTAGE  SpORES   (PS)    P.Y  C^JE CT TVE 


NUMBER  OF 


YTH'JH 


tests/opjcct:ves 

t 

? 

3 

5 

6 

7 

A 

9 

Iw 

11 

12  13 

1^ 

15  OnjECTlVES 

SCOPE 

w^vn  ATTACK  l     fL  3 

S3 

ft  7 

ICO 

81 

78 

ior 

ICG 

75 

1 C  C 

69 

1C 

107 

101 

^.3 

67 

7^ 

100 

no 

mo 

•A  % 

69 

76 

ino 

63  ICO 

1  < 

100 

ICC 

130 

56  - 

5 

3^ 

TEST  O^ASION 


AVtPAGF  PTRFCOM&NCE   ON  OUrCTTV^S 
WA-i     WA-i^     WA-2     WA-3     WA-U     OICT  ^-CMP 


7U 

ft9 
6A 


76 


66 


7U 


PFR2 rN T   OF    OBjrCTTVrS  MASTF^.E^ 
P-RO     WA-1     WA-1H*-     WA-2      WA-3      WA-^  OH 


T  p-Pmo 


3^ 

7C 


6r 


In  sections  8.3.2  and  8.4/2,  we  discussed  the  use  of  Bayesian 
procedures  for  estimating  domain  scores  and  for,  making  decisions  abou£ 
assignment  to  mastery  states,.  (If  .Bayesian  procedures  are  used  for  the 
above  two  purposes,  Ferguson  and  Noviclc  (1973)  have  discussed  the  practical 
feasibility"  of  presenting  individual  data  in  a"  different  form  than  simply 
percent  correct.    Rather  than  simply  presenting  percentage  correct  by 
objective,  Ferguson  and  Novick  call  £or  a  change  to  new  procedures  such 

that:  ,  » 

' Under f  the  proposed  changes,  rather  than  evaluating  * 
student;  proficiency  solely  on  the  posttest  results, 
additional  data  would  be  incorporated  within  the 
■  decision  analysis  process,  and  furthermore,  -the 
quantity  reported  would  be  an  index  relating  the 
students  estimated  proficiency  to  a  stipulated 
standard.    However,  it  should  be  emphasized  that 
although  the  nature  of  the  data  reported  in  the 
student,  profile  would  change,  the  procedures  em- 
ployed by  the  teacher  and/or  student  to  judge  pro- 
ficiency would  remain  the  same. 

•  « 

Thus,  by  employing  Bayesian  procedures,  an  alternate  way  of  presenting 
individual  data  by  objective  can  be  utilized.    This  involves  the  use  of 
a  mastery  index.    Before  discussing  the  index  and  its_ interpretation,  some , 
data,  similar  to  the  data  in  Terguson  and  Novick  (1973),  will  be  presented. 

'  Objective  Percent  Correct  Mastery  Index 

1  87.5  80 


2'75  76 
3  '  100  92 


In  order  to  discuss  the  mastery  index,  a  relevant  cut-off  point,  using 
one  of  the  suggested  methods  in  Unit  6,  must  be  established.     Assume  that  it 
is  .85.    The  mastery  index  for  each  objective  then  gives  the  probability 
that  the  student's  level  of  proficiency  is  above  .85.     For  instance,  on 
objective  2,  the  student  got  75%  of  the  items  correct,  which  is  less  than 

48V 


-50- 


th  e  cut-off  of  .85,  However,  when  the  collateral  and  prior  information  is  combined 
with  the  percent  correct  information,  we  get  a  probabilistic  statement: 

that  although  his/her  percent  correct  score  is  below  the  cut-off,  we  are  still 

* 

76%  certain  his/her  domain  score  is  above  .85.    For  this  student, 
it  is  apparent  that  his/her  performance  on  the  test  is  lower  than  the 
performance  on  the  collateral  data  being  used. 

<  < — ~3r' 

In  sum,  Bayesian  analysis  provides  a  probabilitic  statement  about 

c 

mastery.    The  mastery  index  gives  the~  test  score  interpreter  a  probability 
of  success  figure,  while  the  percent  correct  has  no  probabilistic  inter- 
pretation  attached;  it  is  either  abovfe  or  below  the  cut-off,  •  Mow  might 
this  probability  'Statement  be  used?    Suppose,  for  instance,  you  were 
willing  to  n\pve  a,  student  or}  to  another  objective  if  the  odds  were  better 
than    two    to  one  in  favor  of  his/her. actually  being  proficient.  Then 
the  student  would  be  advanced  if  his/her  probability  of  mastery  was 

*  V 

♦  • 

greater  th^n  .67.    For  the  student  in  our  example,  he/she  would  be  passed  . 
to  the  next  objective  even  though  his/her  percent  correct  score  was  less 


ERLC 


than, the  cut-off.    Rather  than  making  a  yes-no  decision,  the  mastery  index 
allows  you  to  ascertain  the  probability  that  you  are  making  the  correct 
decision.    Of  course,  the  correctness  of  the  probability  statement  will 
depend  on  the  quality  of  the  collateral  and  prior  information  used  in 
obtaining  "revise^"  domain  score  estimates. 

8.  frs2    Group  Test  Score  Information 

In  this  section,  two  examples  of  methods  for  presenting  group  test  . 
data  will  be  discussed.    Then  a  helpful  table  for  making  decisions  about 
group  sizes  will  We  presented  and  discussed. 

The  first  example  is  actually  a  set  of  examples,  based  upon  group 
data  for  the  school  system  discussed  in  the  last  sub-section. 

488  . 


The  first  set  .of  tables  gives  a  district  sunShary  of  performance,  on 
s  *  v  -  V 

each  reading  objective  for  6  tests.    Average  percent  scores  and  pier  cent  age 

t  " 

of  examinee^  who  mastered  are  presented  for  each  objective  for  each  • 
test*    The  first  table. is  collapsed  across  all  6  grades;  subsequent 
tables  g:j.ve  district  information,  but  summarized  by  grade  (two  examples 
are  given} . 


489 


> 


* 

m 


SYSTEM  S^HI'V  0*  PEP^O^MINCF  ON  EACH  LEAPING  OBJECTIVE  fcEPCPTEO 
FOR.EACH  TEST  AD"XNISTE*E0 


DISTRICT  SifcMARY  OF  TEST  RESULTS 


JONF  1077 


TESTS/CJECTIVES 

p*?r?A02N'; 

WC>0  ATTACK  LEVEL  I 
WC°r  ATTACK  IFVFL  2 

w?v«3  atta^x  urveu  3 
r)TCTt>;Acv 


r 

2 

1 

U 

5 

* 

«.? 

87 

07° 

67 

92 

91 

75 

90 

no 

ao 

98 

73 

?i* 

*3 

>  a 

s? 

ftU 

79 

7*" 

81 

71 

*7 

82 

79 

68 

$7 

77 

U7 

U9 

10  11 


95 

97 

91 

at 

86 

"0 

✓  66 

8* 

69 

67 

6? 

67 

78 

85 

63 

YJ 

12 
89 


11 
71 


75  71 


1U 


AVERAGE  PERCEHTAf.F    NH**E». C* 
if;  ops  0*<J~GTIV5  fvAMlS.SCS 


86.C 
8ft.  1 
*Jl.  1 
77.5 
▼  5.  7 
AO,  7 


255 

set 

631 
3M 


TE^TS/C^JFCTIVES 

u^l  ATT*C<  UrVEL  1 

W.T  ATTA^*  LrVFL  3 
W  v*  ^  \  T  »  a  *k  t  rV  *» 
3T5T!».vAOv 


PERCENTAGE  OF  EXAMINEES  WHO  WASTER  EC  OBJECTIVES^ 


1 

2 

3' 

U 

5  ' 

6 

7 

8 

9 

lw 

11 

7? 

72 

90 

08 

61 

81 

87- 

93 

A  J] 

6?  ' 

.  7Z 

95 

66 

87 

66 

63 

87 

56 

58 

7* 

78 

66 

ai 

61' 

3<* 

7*" 

«n 

M 

63 

77 

25 

7n 

18 

*8 

61 

?7 

5" 

*i 

26 

3* 

61 

6S 

8^ 

*2 

17 

1° 

AU 


50 


«1 


.'•VEf,A'ir  PERCENTAGE 
lfc       15  PF*  C^JrCTIVE  « 

fJU  76,5 

*c.  6 
at,  a 

5U.  T 


OT  COPY  AVAILABLE 


ERIC 
/ 


9,i> 


-54- 


,  •       The  following  breakdown,  rather  than  being  district-wide  by  grade, 

*  J    '  ... 

is  hy  school,  across  grades  within  the  school.    Once  again,  percentage 

o 

correct  and  percentage  who  mastered  each  objective  on.  each  test  1b  pre- 
sented*      (School^ "names  are  changed  to  preserve  their  anonymity.)  The 

o 

same  type  of  daua.  can  be  reported  for  each  grade  within  a  school,  and 

mi 

a^so  class  within  a  school. 


492 


SCMOOt  SUMMARY  OF  PEF.'F OP-ANCE  ON  FACM  F.EAOING  OPJFSTIVE 
F9*  FACH  TEST  AO^INTSTEPFD 


Title  I  Students 


School  ■  Green  Valley 

SUMMARY  0*  AVEPA5E  "F^CENTAGr  SORT'S  PV  09.JECTTVFS 


OA  TP*      JUNE   19  77 

AVE9AGF  orscpNTAGF     NUM9E*  OF 


TESTS/OOJfCTIVES 

1 

2 

3 

4 

5 

6 

7 

6 

9 

iO 

ii 

12 

1<* 

15 

oro  03JP*JT:VE 

99 

06 

qb 

88 

99 

•  ice 

lOw 

93 

7) 

78 

90 

98 

83 

9?,? 

?1 

ynoi)  UT3~<  LC#EL  1 

OA 

89 

7* 

71 

8? 

85,3 

M 

vr^o  attack  L-va  2 

o6 

91 

9C 

79 

89 

75 

67 

75 

,  92^ 

6* 

6U 

ATTtCK  VFVEL  3 

66  , 

76 

76  • 

79 

fk 

flu 

97 

85 

56 

72,8 

67 

Wr^-J  ATTACK  LFV^L  V  - 

6ii 

83 

72 

57 

83 

66 

69 

81 

57 

65 

9C 

85 

75.2 

.18  - 

Mb 

81 

77 

57 

U9 

e 

\ 

70.1 

49 

PEPCENTAOC 

OF  \ 

• 

EXAMINEES  ' 

MHO  MASTERED 

09JFCTTVES 

AVERAGE  orpCFNTACe 

TESTS/n?JECTivrS  . 

1 

2 

3 

4 

5 

6 

7 

8 

9 

11 

12 

19 

1U 

15 

PEP  O^JcCTIVE 

OF  EPFAOI^ 

91 

o«, 

9* 

101 

7C 

96 

ICC 

10C 

*7 

52 

87 

96 

7V 

85.1 

WC»9  ATTACK  LEVEL  1  - 

61 

91 

57 

63 

81 

78.2 

ATTACs  LEVEL  2  • 

93 

31 

77 

70 

'  81 

60 

67 

87 

62 

72.6 

WO^C  ATTACK  LEVEL  3 

54 

26 

51 

56 

52 

58 

70 

10 

69 

1 

U5.  7 

ATTACK  lrVEL  U 

6: 

57 

3C« 

Iv 

6C 

61 

34 

73 

13 

37 

6i 

77 

53 

fc?.2 

9I2T13nA°V 

76 

78 

73 

29 

22 

55.  r. 

Tltale  I  Students 


school  *  Humbleton 

SUMMARY  OF  AVERAGE  PFRfENTAGE  SCOPES  9V  OPJFCTIVFS 


OA  TE*       JUMF  19^7 

AVEPAGF  Pr9CFKT AGr     N'J*9FO  C* 


TESTS/^JEC'tVES 

l 

2 

3 

u 

5 

6 

7 

6 

9 

10 

11 

12 

11 

15 

c7A«TVPr^ 

°FE  FF  A*  IV* 

11  w 

1J0 

lOli 

IC'J 

l?i 

92 

9? 

1JJ 

92 

96 

ic: 

95 

97.9 

U 

W?»~  ATTtCK  LCVEL  1 

08 

93 

9d 

81 

^  7U 

85 

87#  U 

It. 

v-.vr,  t  T TA".K  LEVEL  2 

93 

80 

8U 

7r 

79 

60 

71 

5? 

8S 

65 

7<..9 

WO''*,  ATTACK  LEVEL  3 

80 

7? 

9P 

6? 

6U 

82 

91 

69 

61 

78.  C 

15 

WC.RC  ATTACK  LEVEL  k 

89 

97 

77 

77 

73 

77 

93 

79 

Ed 

82 

9<* 

9C 

79 

81.  6 

11 

DICTICMAcv             ...  .  _ 
t  t 

100 

96 

96 

71 

53 

93.2 

11 

PER'CfNT  AGE 

CF  EXAMIMEFS 

ST*'<Fn 

0nJECTI7F<; 

AVE^A'-.E  »EPC*N**Gr 

TESTS/OBJECTIVES 

2 

n 

5 

6 

7 

0 

1; 

11 

12 

13 

1«» 

1* 

PKK  CIJC'CTIVE 

100 

130 

180 

ion 

1  on 

ion' 

7? 

too 

too 

too 

9^.<i 

ATTACK  l^VFL  t 

ICC 

95 

93 

71 

57 

79 

82.1 

WCRC  ATTACK   Lr,«/rL  2 

86 

6* 

E7 

57 

29 

1** 

7 

71 

5: 

UQ.  3 

vr^:  ATT*:<  LrVEL  3 

*7 

73 

53 

53 

*C 

7 

1 

v?«C  A  T  *i^<  LrV?L  k 

9 ! 

5>7 

«*5 

?7 

•? 

7' 

0 

^C, 

^t-  T:rst«*v 

U  4 

10% 

:;c 

«i  7 

H 

<  9.1 

493 


The  final  set  of  tables  Is  a  pre€est-posttest  analysis  of  xme  of  the 
six  reading  tests.    Note  that  there  are  two  tables  for  each  test;  one 
giving  percent  performance  and  the  other  percent  mastery.    The  cells 
give  pretest  results  (October),  posttest  results  (May),  and  a  percentage 
gain  score.    The  data  is  presented  for  each  grade(s)  in  which  the  test' 
was  administered.    Rick  DeFriesse,  from  the  Laboratory  of  Psychometric  \ 
and  Evaluative  Research  at  the  University  of  Massachusetts,  Amherst, 
developed  the  computer  program  to  produce  the  table* 


494 


PRE  TEST- 


POSTTEST,     ANALYSIS       OF       THE  READING 

--I-N'-TT-N-T-  TJ-R  "T'm    °-  T"'S  IT  l  T~S  ~  f'T-?Tr--rr7-7"f 

GREEN  VALLEY  _  _  _  ...  .   

»»»     HOSD  ATTACK  LEVEL  2 


S  <  ILLS 


 &yjm 

4 

OBJECTIVE 


grade:  2 

I  N  =  13ft  1 

%  performance: 

"OCT  "  MAY   G A  I ! 


I  G  W  A  1  F  3 
I     <   N  =  165  > 


IGRAOE  <♦  I 
J  (  N  =  L  3  U  )  I 
I  %  PERFORMANCE  I 


82 

i 

15  I 

&k 

'T- 
IS I 

a& 

I 

"28  I  ~ 

-68~ 

I 

16  I 

65 

...  ....j— 

21  I 

6Q 

I 

13  I" 

'  73 

I 

33  I 

5A 

•     ~  I~ 
Ik  I 

91 

I 

5k  I 

38 

I 
z 
z 

"I 


»»  ALL  GRAOES 
<   N  *  352  >  Z 

X  PERfOe-lfiNCE  Z 
OCT    "MAY  GAIN'T 


PH3 

PH6  

PHI 
PH*  " 
PH9 
PM11 
SAf" 

SAA 


55 
65 
51 

l«8 
30 
A6 
16 
kZ 
5 


53 
69 
66 

T5_ 
6U 
69 
61 

*b0 
<31 
'  60 


I 

36  : 
z 

2k  I 

 j..- 

3U  I 

I  , 

■A6  :  " 

T 

36  I 

Z  " 
39  I 

I 

35  I 
I 

kk  I 

  ■  r 

<»9  Z 
I 

5k  T 

T 


61 
76 
79 
62 
76 
59 
73 
A3 
79 
31 


95 
93 


96 
86 


9A 

90" 

92 

60 

86 

73 

<ik 

65 


93 
78 
89 
70 
81 
11 
93 
78 


16 
2 
8 

11 
k 

10 
-•8 

23 
C 

-  AO 


Z 

z 

T~ 

Z 

Z 

I" 
I 

z 

r" 

z 

z 

I 

z 

z 

T" 

z 
z 
I 

I 


T 
I 
I 

*  ♦  **" 

1 

z 

T— 

z 

I 

r 

z 

z 

I 

I 

I 

I " 

z 

z 

I 

I 


71    "  95 

lk  91 

69  91 
5D  ""63" 
67  89 


AS 

63 
3A 
67 

2? 


75 

63 
73_ 
9? 
75 


23  Z  " 
Z 

17  X 

 j- 

21  Z 
I 

73  T" 
Z 

22  Z 

— :i~ 

27  I 
I 

21  "Z" 
I 

36  r 

2t  i 
i 

53  I* 
I 


I 

v-n 
I 


PR  f   TM  T**  Pb  S  t  T  E'S  T       AMA.Y3XS       OF       THE       *  £  A  3   I  N  G       3  <  ILLS 

—  — -.  :  h  rz  H'vxr~t~i — jits'  <*i~9  tt-t 9"7"7T"  ; 

GREEN  VALLEY 

.  .     .  ............  •  « 

•  **     VI 03  D  AT  TACK  LtV£L   2  r 


j'crlv:  r 


i"< ...... 


o! 
i 


01 


n  '3 
OH/ 

PHli 

*>  at 


f.V"P.V 


I 

r 
i 

T 

4 

I 

I 

I 

I 

I 

r 
i 


3  6" 

to 

v 
23 
1 3 
*  5 


16 


*t.  r,-  -  • 

c 

132  ) 

I  ( 

A  Y  OA 

*  *  *  W 

I  N 

I  2 
• 

1" 

7  > 

t*7 

1 

I  " " 

73 

5a 

I  f  ^ 

I 

o2  " 

56 

T 

77 

5c 

r 

1 

59 

Uf> 

T 

I  * 

5S  * 

51 

I  * 

2: 

n 

S5 

; 

<*7 

m  m          «•  < 

65 

51 

I 

1 

o«?<-     3  :  g  t*  a  0  e  : 

>j  =  :^>u  1  I  '(  n  s  5J!»  i  I 

'<  J*AriTF.'Y  I        *<  MAS  T*  PY  I  ^_ 

CT     NAY    fi  A I N  I  ""OCT     MAY   GAIN  I 


o  3 
S2 

*>2 
31 
51 
33 

6 
57 
.2 


5 '3 
91 
S3 
91 
*7f> 
73 
55 
PS 
6C 

32 


T 

30  I 

3*  1 
Y  ' 

3t  i 
I 
1 
1 

U3  I 

-  T 

Uh  1 

■  I 

■  I 

1 

52  r 
x 

3^  1 

70  I 
I 


I 


59 
e>5  • 

76 

3  6. 
3? 

15.. 

12 

21 


7<* 

od 

65- 

b2 

i»7 

59 

52 

79 

62 


51  73 


16  1 

■"■  i' 
12  : 
1 

32  I 
I 

12  I 
I " 

21  I 
I 

-32  I 
I 

35  I 

•  I 
"9  I 
I 

UU  I 

I 


26  I 
I 


I 
I 
I 

I  ' 

\ 

» 

1 
1 

* 

I 
I 

»  ■ 

I 
r 

"  T " 
I 
I 
I 
1 
I 

-  T 

f 

1 

T 
* 

I 

"I 
I 
I 


*'  *  m  a     •  .  ^  •  ~  •*  ? 


11  1      •  t 


i 
I 

•    I  " 

I 
T 

_fl  „  - 


1 
1 
I 


1 

T 

i 
1 
1 

■  1  ' 


1 

i" 

T 

i 

"T 

7 

I 


5  '« 

■*  % 

'i  3 

<*7 

3 

"■"22 

7S  " 

?5 

27 

'" 

7 

^  3 

ua 

9 

^7 

3  3 

7^ 

VJ1 

00 


■"-••» 
In  sum,  there  are  a  wide  variety  of  ways  in  which  the  data  just  pre 

sented  could  be  of  use  to  decision  makers.    We  have  presented  the  tables 

to  the  reader  as  *an  example  of  a     viable  method  for  reporting  group 

*  • 

test  score  d^ta*  * 

-  The  next  example, .taken  from  Hambleton,  Gorth,  and  O'Reilly  (1973), 
demonstrates  how  a  summary  of  group  performance  by  objective  across  test 
administrations  is  helpful  for  decision-making  purposes.    We  present\ 
the  relevant  figure  first.    The  discussion  that  follows  the  figure  is 
taken  directly  from  Hambleton,  Gorth,  and  O'Reilly. 


499 


Figure  I.  Achievement  Profiles  of  A  Group  of  Students  on  Four  Objectives  Across 

•    .  Eight  Test  Administrations." 

100  r 


2 


3  4  5  6 

Test  Administration 


Objective  4 
Objective  2 


Objective  3 
Objective  I 


:  \ 


I 

a* 
o 


500 


9 

ERIC 


Figure  1« presents  hypothetical  levels  of  achievement  for  four 
objectives  across  eight  test  qccasions.    In  this  example  ,  Objective  1 
was  taught  between  the  first  and  second  test  occasion,  Objective  3^  f 
between  the  third  and  fourth  test  occasion  and  Objective  4  between:  the  . 
fourth  and  fifth."  Fqr  the  reason  given  below,  Objective  2  was  not  taught. 
On  the  pretest  in  the  example,  all  objectives  except  number  2  show 
achievement  at  the  chance  level  or  at  about  20%  on  the  five-option, 
multiple-choice  items.    From  an  analysis  of  the  data  after  the  second 
tes't  occasion,  the  following  decisions  might  be  made:    (a)  Objective  1 
was  not  learned  and  should  probably  be  retaught  in  a  somewhat  different 
way;    (b)  since  the  performance  level  on  Objective  2  was  high  on  both  the- 
first  and  second  test  occasion,  one  could  safely  skip  instruction  on  it. 
After  the  sixth  test  occasion,  the  following  decision  could  be  made  on  the 
ba^is  of  the  data:    (a)  the  performance  level  on  Objective  3  is  slipping; 
if  it  is  an  important  objective  it  should  be  reviewed.    It  is  also  noted 
that  _the_perfomance  level _on -Objective- 1  has-not-changed.  One-might  

i 

postulate  that  Objective  1  is  just  too  difficult  for  this  particular 

4 

/ 

group  of  students. 

Finallyi  the  table  that  follows,  taken  from  Millman  (1972),  may  be' 
helpful  to  the  reader  when  he/she  has  to  decide  aobut  the  number  of 
students  needed  in  a  testing  situation.    The  table  is  self-explanatory. 


501 

s 


Table  1 


1 


Maximum  Percent  of  Time  That 'A  Given  Error 
Will  Occur  For  Selected  Test  Group  Sizes 


Number  of  Students 
Needed  for  Testing  

 — t 

Error  That  Can  be  Tolerated  in  Estimating  the 

True  Proportion  of  All  Students  Who  Can  Pass  An  Item 

"  10%            15%         ~~2~0T~" 25%  302 

10 

75a 

34  . 

34  • 

11 

11 

15 

'61 

30 

12 

t 

4 

20 

26 

Cm  \J 

12 

4 

1 

'25  1 

42 

11 

A. 

1 

<1 

30 

36 

10 

4 

1 

<1 

40 

27 

'  8 

1 

<1 

9 

<1 

50 

20 

3 

•  1 

<1 

<1 

60 

>  16 

3 

<1 

* 

<1 

"  <1 

75 

11 

1 

<1 

<1 

<1  . 

100 

6 

<1 

<1 

<1 

-<1 

150 

2 

<1 

<1 

<1 

<1 

-  200 

1 

;<i 

<1 

<1 

<1  — 

<1 

250   

<1 

.  —  <r  - 

• 

<1  " 

<1 

JThi8  table 
from  Millman  (1972) . 


is  reproduced  (with,  permission  and  with  minor  changes) 


'  aThe  number  "75"  has  the  following  interpretation:     When  a 

random  sample  o£  .10  examinees  is  used  to  estimate  the  proportion 
of  examLnees  in  the  population  who  can  answer  the  item  correctly, 
the  likelihood  that  the  estimate  will  be  off  at  least  10%  is  no 
more  than  .75. 


502 


-63- 

8.  7  Grading 

In  this  section  of  Unit  8,  we  will  discuss  two  aspects 
of  using  criterion-referenced  test  scores  in  the  grading  process.  First, 
we,  will  discuss  how  one  might  best  grade  a  student  on  the  activities  he/ 
she  has  undertaken  in  an  objectives-based  program.    Then,  we  will  discuss 
the  issue  of  how. one  assigns  final  grades  in  such  a  program.  However, 
..before  undertaking  this  discussion,  we'd  like  to  direct  the  reader  to  two 
sources  that  do  an  excellent  job  of  comparing  and  contrasting  norm  and 
criterion-referenced  grading  procedures.    These  are  the  1970  article  by 
Millman  in  Phi  Delta  Kappan  and  the  1974  book  by  GronlUnd  on  Improving 
Marking  and  Reporting  in  Classroom  Instruction. ^ 

Since  grading  in  an  objectives-based  program  does  not  compare  stu- 
dents, but  rather  references  the  student's  performances  to  the  objectives, 
a  single  checklist  is  the  best  form  for  grading.    If  the  instruction  is 
group-basrjd,  a  check  mark  next  to  :he  objectives  that  were  mastered  is 
-sufficient. -— However^  "If "the" instruction  is  individualized,  the  date' of 
mastery  can  be  placed  next  to  the  objective  to  give  a  better  indication  of 
progress.    The  following  example,  taken  from  Millman  (1970),  is  an  example 
of  the  latter  sort  of  checklist: 


503 


-64- 


Report  Card  Based  on  a  System  of  Crirerion-ReferenccH  Measurement 

( 

MATHEMATICS 
o  Grade  Two 

Skill  ^ 

Concepts 

Understands  commu  laltve  properly  of  addition  (e,g, ,  4  ♦  3  =  3  ♦  4  *A? 
Understands  place  value  (e.g,,  27-2  tens  t  7  ones)  JT 

Supplies  missing  addend  under  10  (e,g.f  3  +  ?  »  3)  u/f 
Adds  three  sinVe-diiiit  numbers  '  .-JKL. 

Knows  combination j  10  through  19  «  *  

•Adds  two  2-digit  numbers  without  carrying  '  

•Adds  two  2-digit  numbers  with  carrying  * — 

Subtraction 

Knows  combinations  through  9  */* 
•Supplies  mining  aibtrahend  -  under  10  (e.g..  6  -  '  *  J)  ' 

•Supplies  missing  minuend  -  under  10  (e ,g„  ?  -  3  *  4)   

•Knows  combinations  10  thiough  19   - 

•Subtracts  two  2-digit  numbers  without  borrowing  1 
Measurement 

Reads  and  draws  clocks  (up  to  quarter  hour) 

understands  dollar  value  of  money  (coins  up  to  $  1 ,00  total)  .m  T  '  " 

Geometry  5 

Understands  symmetry 

Recognizes  congruent  plan  figures  -  that  is,  figures  which  *  

are  identical  except  for  orientation  * 
Graph  Retting 

•Knows  how  to  construct  simple  graphs 

•Knows  how  to  read  simple  graphs  ~~*  


(Reproduced  with  permission,  from  Millman,  1970.) 


504 


*  N 


ERIC  . 


-65- 

t 

While  criterion-referenced  testing  is  highly  appropriate  for 
monitoring  student  progress  through  the  units  of  instruction  making 
up  a  course,  the  question  often  asked  is,  "How  should  final  grades  in 
an  objectives-based  course  be  assigned?"    The  issue  of  course  grading 
has  been  hotly  debated  by  administrators,  teachers,  and  students.  That 
the  issue  is  important  is  clear  when  it  is  recognized  that  grades  affect 
career  choices  of  many  students  and  their  attitude  toward  learning, 
the  amount  of  learning,  and  the  amount  of  time  spent  in  study.  Unfor- 
tuantely  though,  because  of  the  confusion  over  the  purposes  of  grading 
and  the  inexperience  of  most  instructors  in  areas  of  tests  and  measure- 
ments, much  of  final  grading  is  done  rather  badly.    Within  objectives- 
based  courses,  the  purpose  of  grading  is  clear  and  unequivocal.  The 

t. 

purpose  of  final  grading  is  to  indicate  the  overall  level  of  accomplish- 
ment of  each  student  relative  to  the  course  objectives. 

How  should  a  final  examinacion  be  prepared?    One  highly  acceptable 
way  has  been  discussed  by  Block  (1971)  within  the  context  of  mastery 
learning  programs.  f 

The  instructor  determines  the  amount  of  time  required  for  the 
final  examination  (often  one  to  two  hours)  and  then'  proceeds  to  select    ^  - 
test  items  from  the  available  pools  of  items  measuring  the  course  objec- 
tives, preferably  items  that  were  not  included  in  any  of  the  unit  tests. 
The  items  are  selected  to  be  representative  of  the  course  objectives. 
Depending  on  the  number  of  course  objectives,  and  the  time  available- 
for  testing,  some  course  objectives  may  not  be  tested  in  the  final 


505 


examination.    The  key  concern  is  to  develop  a  final  examination  such 
that  the  test  items  can  be  considered  to^be^a  representative  sample* of 
the  material,  covered  in  the  course. 

Let  us  assume  that  your  particular  school  insists  that  letter  grades 
be  assigned  to  students  to  reflect  their  work  in  the  course.  -This  con- 
straint  should  pose  no  serious  problem  to  the  teacher.    The  test 'is  de-- 
signed* to  provide  test^ scores  tfrat  can  be  used  to  infer  a  student1 s 
level  of  mastery  of  the  course  content.    The  instructor's  task  is  to 

define  the  levels  of  performance  that  he/she  feels  reflect  A,  B,  C,  n,  . 

i 

and  F  grade  level  work.    For  example,  the  instructor  may  decide  that  the 

V 

appropriate  values  are  90%,  80%,  70%,  and  60%,  respectively.  These 
values  "can  be  made  known  to  the  students  and  ..ev6ri  discussed  with  them. 

Because  of  the  way  the  test  is  constructed  (sampling  of  items 
to  be  representative  of  the  course  objectives),  the  setting' of  perfor- 
mance standards  can  be  done  on  a  test  score  scale  that  has  some  real,  _ 
meaning.    Certainly,  the  usual  test  score  scales  have  little  meaning  since 
one  can  seldom  think  of  the  items  as  a  sample  from, any  well-defined 
domain  and  therefore  the  only  basis  for  test  score  interpretation  is 
to  compare  one  score  with  another,    Grades  in  objectives-based  courses 
are  assigned  to  students  on  the  basis  of  their  test  performance  relative 
to  the  performance  standards  that  are  set  to  reflect  different  .levels  of 
mastery  of  course  objectives. 

The  matter  of  combining  unit  test  score  results  with  final  examina- 
tion  results  to  produce  a  final  grade  will  not  be  discussed  here,  but 
the  problem  is  a  relatively  simple  one  to  resolve  statistically.  Factors 
such  as  the  relative  Importance  of  uni*  tests  versus  a  final  examination 
would  need  to  be  considered  in  determining  the  most  desirable  weighting 
factors  for  the  two  sources  of  test  information. 

506 


-67- 


*In  sura,  it  can  be  seen  that  the-  necessity  for  assigning  final  grades 
in  a  course  is  amenable  to  an  objectives-based  program  that  utilizes 
criterion-referenced  tests. 


3 


/ 


■4 


9 

ERIC 


507 


-68- 


8.8  References 

Block,  J.  H.  (Ed.)    Mastjery  learning ;    Theory  and  practice.    New  York: 
Holt,  Rinehart,  add  Winston,  1971. 

Ebel,  R.  L.    Content  standard  test  scores.    Educational  and  Psychological 
Measurement ,  1962,  22,  11-17. 

Ferguson,  R.  L. ,  &  Novick,  M.  R.    Implementation  of.»a  Bayesian  system 
for  decision  analysis  in  a  program  of  individually  prescribed 
instruction.    ACT  Research  Report  No.  60.    Iowa  City,  Iowa:  > 
American  College  Testing  Program,  1973. 

Fremer,  J.    Handbook  for  conducting  task  analysis  and  developing  criterion- 
referenced  tests  of  language  skills.    PR  74-12.    Princeton,  New' 
Jersey:    Educational  Testing  Service,  1974. 

Gronlund,  N.  E.    Improving  marking  and  reporting  in  classroom  instruction. 
New  York:    Macmillan,  1974. 

Gronlund,  N.  E.    Measurement  and  evaluation  in  teaching.     (3rd.  ed.) 
New  York:    Macmillan,  1976. 

Hambleton,  R.  K. ,  &  Gifford,  J.  A.    Development  and  use  of  criterion- 
referenced  tests  to  evaluate  program  effectiveness.  Laboratory 
of  Psychometric  and  Evaluative  Research  Report  No.  52.  Amherst, 
MA:    School  of . Education,  University  of  Massachusetts,  1977. 

Hambleton,  R.  Ko ,  Gorth,  W.  P.,  &  O'Reilly,  R.  P.    An  application  of  an 

evaluation  model  for  classroom  instruction.    Journal  of  Educational 
Technology  Systems,  1973,  2,  117-131. 

Hambleton,  R.  K. ,  Hutten,  L.  R. ,  &  Swaminathan,  H.    A  comparison  of 
several  methods  for  assessing  student  mastery  in  objectives- 
based  instructional  programs.    Journal  of  Experimental  Education, 
1976,  45,  57-64. 

Hambleton,  R.  K. ,  &  Novick,  M.  R.    Toward  an  integration  of  theory  and 
method  for  criterion-referenced  tests.    Journal  of  Educational 
Measurement,  1973,  10,  159-170. 

Hambleton,  R.  K. ,  Swaminathan,  H. ,  &  Algina,  J.     Some  contributions  to 
the  theory  and  practice  of  criterion-referenced  testing.  In 
D.  N.  M.  de  Gruijter,  and  L.  J.  Th.  van  der  Kamp  (Eds.),  Advances 
in  psychological  and  educational  measurement.    New  York:  Wiley, 
1976. 

Hambleton,  R.  K. ,  Swaminathan,  H.,  Algina,  J.,  &  Coulson,  D.  B.  Criterion- 
referenced  testing  and  measurement:     A  review  of  technical  issues 
and  developments.    Review  of  Educational  Research,  1978,  4j8,  1-47. 


ERIC 


508 


Huynh,  H.    Statistical  consideration  of*  mastery  scores.  Psycho'metrika, 
1976,  41,  65-78. 

Huynh,  "h.    pliability  of  multiple  classifications.    Psychometrik'a,  1978, 
43,  317-325. 

Jackson    P.  H.    Simple  approximations  in  the  estimation  of  many  parameter 
JaCkS°S:,:;o"  tnnr^i  of  MatheniaH™"  *nd  Statistical  Psychology,  1972, 
25 ,  213-229 . 

Lewis,  C,  Wang,  M.  M. ,  &  Novick,  M.  R.    Marginal  distributions  for  the 
estimation  of  proportions  'in  m  groups.    ACT  Technical  Bulletin 
No.  13.    iowa  City,  Iowa:    The  American  College  Testing  Program, 
1973. 

Lewis,  C,  Wang,  M.  M. ,  &  Novick,  M.  R.    Marginal  distributions  for  the 
estimation  of' proportions  in  m  groups.    Psvchometrika,  1975,  40, 
63-75. 

Jh  '      . ...  •  '  v  

Linden,  W.  J.,  &  Meilenbergh,  G.  J.    Optimal  cutting  scores  using  a 
,   linear  loss  function.    Applied  Psychological  Measurement,  1977, 
1,  593-599. 

Livingston,  S.  A.  Criterion-referenced  applications  of  classical  test 
theory.    Journal  of  Educational  Measurement,  1972,  9_,  13-26. 

Livingston,  S.  A.    A  utili  ,  I  aed  approach  to  the  evaluation  of  pass/ 
fail  testing  decision  procedures.    COPA  Research  Report.  Prince- 
ton, N.J.:    Educational  Testing  Service,  1975. 

Lord,  F.  M. ,  &  Novick,  M.  R.    Statistical  theories  of  mental  test  scores. 
Reading,  Mass. :    Addison-Wesley,  1968. 

Millman,  J.    Reporting  student  progress:    A  case  for  a  criterion- 
referenced  marking  system.    Phi  Delta  Kappan,  1970,  5£,  226-230. 

,  ♦ 

Millman,  J.  Determining  test  length:  Passing  scores  and  test  lengths 
for  objectives-based  tests.  Instructional  objectives  exchange,  ' 
Los  Angeles,  California,  1972. 

Millman,  J.    Criterion-referenced  measurement.     In  W.  J.  Popham  (Ed.), 
Evaluation  in  education:    Current  applications.    Berke ley, 
California:    McCutchan  Publishing  Co.,  1974. 

Novick,  M.  R. ,  &  Jackson,  P.  H.     Statistical  methods  'for  educational  - 
and  psychological  research^'  New  York:    McGraw-Hill,  1974. 

Novick,  M.  R. ,  &  Lewis,  C.    Prescribing  test  length  for  criterion-  < 
referenced  measurement.    Ir  C.^W.,  Harris,  M.  C.  Alkin,  &  W.  J. 
Popham  (Eds.),  Problems  in  criterion-referenced  measurement. 
CSE  monograph  series  in  evaluation,  No.  3.    Los  Angeles:  Center 
for  the  Study  of  Evaluation,  University  of  California,  1974. 


509 


Novick,  M.  R.,  Lewie,  C,  &  Jackson,  P.  H.    The  estimation  of  proportions 
*in  m  groups.    Psychometrika,  1973,  j|8,  19-45. 

Popham,  W.  J.    Educational  evaluation.    Englewood  Cliffs,  N.J.*  Prentice- 
Hall,  1975. 

Swaminalhan,  H. ,  Hambleton,  R.  K. ,  &  Algina,  J.    A  Bayesian  decision-  § 
theoretic  procedure  for  use  with  criterion-referenced  tests. 
Journal  of  Educational  Measurement,  1975,  12,  87-98. 

Wang,  M.  M.    Tables  of  constants  for  the  posterior  marginal  estimates 
of  proportions  in  m  groups.    ACT ^Technical  Bulletin  No.  14. 
Iowa  City,  Iowa:    The  American  College  Testing  Program,  1973. 


510 


Unit  9  v 

Design  of  Cr  it  erion-r  Referenced  Testing  Programs1 

-Two  Examples- 


Prepared  By 


*    Ronald  K.  Hambleton 
University  of  Massachusetts 3  Amherst 

and 

Daniel  R.  EignBr 
Educational  Testing  Service 


March  15,  1979 


Substantial  portions  of  the  material  in  this  unit  were 
drawi*  from  Hambleton,  R.K.,  Testing  and  decision-making  procedures 
for  selected  individualized  instructional  programs.    Review  of  Educa- 
tional Research,  1974,  44,  371-400. 


511 


c 


Table  of  Contents 

*  Page 

9.0  Overview.  ................                               *  1 

9.1  Introduction.   2 

9.2  Individualized  Instructional  Programs    3 

9.3  Instructional  Models  Under  Consideration   .5 

9.4  Individually  Prescribe4|lttetruction  (TPI)  -  6 

9.4.1  Instructional  Paradigm   0 

 ft  jlJk-Teatinfc  Jfc^JL^^  >  >   f 

9.4.3  Summary  Comments  1£> 

«  .■ 

9 .5  Mastery 'Xi«fnlng*^"T^ '.' .  ~.  •  '      —,-""*."""."  T"".  —;••"*;  --is— 

9.5;r"in8tfuct^  .  .       •  •  •■•  -i* 

9.5.2  Testing  Model  Description    18 

9.5.3  Summary  %  22 

9.6  Summary  23 

9.7  References  Cited.  2* 


9.8    References  for  Further  Study, 


26 


512 


9 

ERIC 


9 . 0  Overview 

Previous  units  have  concentrated  on  the  development,  validation, 
and  usage  of  criterion-referenced  tests.    In  this  unit,  we  will  consider 
two  examples  where* criterion-referenced  tests  are  used  to  serve  a 
variety  of  instructional  purposes. 


513 


9.1  Introduction 

The  primary  "pur pose,  of  the  unit  is  to  introduce  readers  to  the 
nature  of  individualized  instructional  programs  and  to  two  testing 
programs  that _arfi_ in  wide  use: Individually-Prescribed  Instruction 
(Glaser,  1968)  and  Mastery  Learning  (Block,  1971;  Bloom,  1976). 


I  0 


\ 


514 


3- 


9.2    Individualized  Instructional  Programs 

The  idea  of  developing  instructional  programs  in  our  schools  to 
meet  individual  student  reeds  is  not  a  new  theme  in  American  education 
(Washburne,  1922),  but  it  has  been  only  since  the  early  I960fs  that 
such  programs  have  been  implemented  on  any  large-scale  basis  in  the 

r 

schools. 

The  basic  argument  in  favor  of  individualizing  instruction  comes 
from  a  multitude  of  research  and  evaluation  studies  that  suggest  that 
students  differ  in  interests,  motivation,  learning  rate,  goals,  and 
capacity  for  learning,  among  other  things;  and,  therefore,  group-based 
instruction  on'  a  common  curriculum  is  inappropriate  to  meet  their 
educational  needs.    The  necessity  for  change  in  our  schools  is  evident 
when  it  is  noted,  for  example,  that  schools  provide  successful  learning 
experiences  for    only  one-third  of  the  students  (Block,  1971). 

The  trend  toward  individualization  of  instruction  in  elementary 
and  secondary  education  and  (to  a  lesser  extent)  in  higher  education 
and  technical  education,  has  resulted  in  the  development  of  a  diverse 
collection  of  attractive  alternative  models  (Gibbons, -1970;  Gronlund, 
1974)  that,  according  to  their  supporters,  offer  new  approaches  to 
student  learning  that  can  provide  almost  all  students  with  rewarding 
school  experiences. 

In  the  relatively  short  period  of  time  that  large-scale  individual- 
ized instructional  programs  have  been  under  development,  much  has  been 
learned  about  the  construction  of  instructional  materials,  curriculum 
design,  and  computer  management  (Baker,  1971).     However,  until  recently, 
corresponding  progress  was  not  made  in  developing  relevant  testing  methods 
and  decision  procedures. 

515 


One  reason  for  a  shortage  of  testing  information  was  that  measure- 
ment requirements  within  the  context  of  many  of  the  new  instructional 
programs  required  new  kinds  of  tests.    These  are  criterion-referenced 
tests ,  which  are  constructed  and  interpreted  in  ways  quite  different 
from  the  norm-referenced  tests  with  which  most  practitioners  in  the 
field  are  familiar.    Fortunately,  much  progress  toward  a  theory  and 
practice  of  criterion-referenced  testing  has  been  made  in  recent  years 
and  many  of  these  developments  have  been  described  /by  Hambleton  et  al. 
(1978),  Millman  (1974),  and  Popham  (1978). 

Since  one  of  the  major  purposes  of  individualized  programs  is  to 
maximize  the  opportunity  for  all  students  to  learn,  it  follows  that  tests 
used  to  monitor  student  progress  should  be  keyed  to  the  instruction  presented. 
Furthermore,  they  should  provide  information  that  can  be  used  to  measure 
progress  along  an  absolute  achievement  continuum.    Norm-referenced  tests  are 
constructed  specifically  to  facilitate  the  making  of  comparisons  among  students 
hence,  they  are  not  very  well-suited  for  making  most  of  the  instructional 
decisions  required  in  individualized  instructional  programs. 


516 


Cronbach  (1967)  discussed  three  major  patterns  of  dealing  with 

o 

individual  differences  that  provide  a  fraraewock  for  the  instructional 
programs  considered  in  this  unit.    Patterns  of  dealing  with  individual 
differences  in  schools  can  be  described  in  terms  of  the  extent  to  which 
educational  goals  and  instructional  methods  are  varied.    In  one  pattern, 
the  educational  goals  and  instructional  methods  are  relatively  fixed 
and  inflexible.    Individual  differences  are  handled  mainly  by  dropping 
students  from  a  program  when  they  begin  to  encounter  difficulty.    In  a 
second  pattern,  goals  are    selected  for  students  on  the  basis  of  interest 
and  potential,  and  the  students  are  channeled  into  one  fixed  program  or 
another.    Individual  differences  are  handled  by  providing  multiple 
optional  programs.    Programs  described  in  this  unit  fit  into  a  third 
pattern  where  goals  and  instructional  resources  are  individualized  for 
the  purpose  of  maximizing  learning  and  development.    Although  there  are 
hundreds  upon  hundreds  of  versions  of  instructional  programs  that  would 
fit  into  this  third  pattern  of  individualizing' instruction,  the  two 
programs  we  have  selected  incorporate  most,  if  not  all,  of  the  forms 
of  testing  that  are  likely  to  be  found  in  an  individualized  instt.  ctional 
program. 

Our  concern  is  with  individualized  instructional  programs  that 
include  a  specification  of  the  curriculum  in  terms  of  ob- 
jectives, detailed  diagnosis  of  the  entering  competencies  of  students, 
the  availability  of  multiple  instructional  resources,  individual  pacing, 
and  sequencing  of  material,  as  well  as  the  careful  monitoring  of  student 
progress.    Thus,  our  concern  is  with  the  most  highly  structured  individual- 
ized instructional  programs  that  require  substantially  more  testing  than 

< 

other  individual  programs,  such  as  the  open-classroom  plan. 

517 


9A    Individually  Prescribed  Instruction  (IPI) 

The  Learning  Research  and  Development  Center  (LRDC)  at  the  Univer- 
sity of  Pittsburgh  initiated  the  Individually  Prescribed  Instruction 
Project  during  the  early  1960's  at  the  Oakleaf  School,  in  cooperation 
with  the  Baldwin-Whitehall  Public  School  District  near  Pittsburgh.  As 
of  1974,  the  IPI  program  had  been  adopted  by  over  250  schools  around 
the  country.     We  are  not  aware  of  any  more  recent  count. 

9.4.1    Instructional  Paradigm 

Although  the  instructional  paradigm  and  the  corresponding  test 
model  are  discussed  in  the  context  of  the  IPI  mathematics  program,  the  . 
procedures,  techniques,  etc.,  described,  are  also  applicable  for  the 
other  content  areas  covered  in  1;he  program.    In  addition,  it  should  be 
noted  that  the  mathematics  program  as  implemented  is  probably  somewhat 
different  from  that  described  here,  since  the  LRDC  is  constantly  re-^ 
fining  and  improving  the  program  (Lindvall,  personal  communication) . 

Cooley  and  Glaser  (1969)  reported  that  the  mathematics  curriculum 
consists  of  430  specified  instructional  objectives.    These  objectives 
are  grouped  into  88  units.     (In  the  1972' version  of  the  program,  there 
were  359  objectives  organized  into  71  units.)    Each  unit  is  an  instruct 
tional  .entity,  which  the  student  works  through  at  any  one  time.  There 
are  5  objectives  per  unit,  on  the  average,  the  range  being  1  to  14. 
A  collection  of  units  covering  different  subject  areas  in  mathematics 

comprises  a  level;  the  level  may  be  thought  of  as  rou&hly  comparable  to 

1 

school  grades.  The  number  of  objectives  for  each  unit!  in  the  IPI  mathe- 
matics curriculum  is  presented  in  Table  9.A.1* 

518 


Table  9.4.1 


Number  of  Objectives  for  Each  Unit  in  the  IPl  Mathematics  Curriculum* 


Content  Area 

Levels 

A 

B 

c 

D 

"  E 

F 

O 

H 

— ■«  — 

Numeration 

12 

10 

8 

8 

8 

3 

8 

4 

Place  Value 

3 

5 

10 

7 

5 

2 

1 

Addition 

3 

10 

S 

8 

6 

2 

3 

2 

Subtraction 

4 

6 

3 

1 

3 

1 

Multiplication 

8 

11 

10 

6 

3 

Division 

7 

7 

9 

5 

5 

Combination  of  Processes 

6 

5 

7 

4 

5 

6 

Fractions  * 

3 

2 

4 

6 

6 

14 

5 

2 

Money 

4 

4 

6 

4 

1 

1 

Time 

3 

2 

7 

9 

5 

3 

Systems  of  Measurement 

4; 

3 

5 

7 

3 

2 

Geometry 

.  2 

2 

3 

9 

10 

7 

9 

Special  Topics 

1 

3 

3 

5 

4 

5 

1  Reproduced  by  permission  from  Lindvall,  Cox,  and  Bolvin  (1970). 


A  teacher  is  faced  with  the  problem  of  locating, for  students,  that 

point  in  the  curriculum  where  they  can  most  profitably  begin  instruction. 

J  '  ' 

Also,  a  teacher  is  responsible  for  the  continuous  diagnosis  ofLstudent 

..  '  > 

mastery  as  students  groceed  through  their  programs  of  study., 

At  the  beginning  of  each  school  year,  a  teacher  places  a  student 
within" the  curriculum;  that  is,  a  teacher  identifJes  the  units  in  each 
content  area  for  which. instruction  is  required.    After  completing  the 
gross  placement,  a  single  unit  is  selected  as  the  starting  point  for 
instruction,  and  a  diagnostic  instrument  is  administered  to  assess  the 
student's  competencies' on  objectives  within  the  unit.    The  outcome  of 
the  unit  test  is  informatibji  appropriate  for  prescribing  instruction 
on  each  objective  in  the  unit.    In  addition ,  it  is  also  necessary  to 
select  the  particular  set  of  resources  for  a  student.    In  theory,  resources 
that  match  the  individual's  "learning  style1'  are  selected.    Within  each 
unit,  there  are  short  tests  to  monitor  the  student's  progress.  Finally, 
upon  completion  of  initial  instruction  in  each  unit,  assessment  and  diag- 
nostic testing  takes  place,    ^n  the  next  section,  the  tests  and  the 
mechanisms  for  making  these  decisions  are  reviewed. 

9.4.2    Testing  Model  Description 

Various  research  reports  over  the  last  couple  of  years  have  dealt 
with  the  testing  model  and  its  development  (see,  for  example,  Glaser  & 
Nitko,  1971).    A  flow  chart  of  the  testing  model  is  presented  in  Figure  9./.. 
To  monitor  a  student  through  the  program  the  following  tests  are  used: 
Placement  tests,  unit  pretests,  unit  posttests,  and  curriculum-embedded 
tests.    All  of  the  tests  are  criterion-referenced,  with  performance  ^i 
the  tests  compared  to  performance  standards  for  the  purpose  of  decision- 
making. 520 


-9- 


Placeoent  Test 
Taken 


One  specific  unit 
selected  for  study 


Unit  Pretest 
Taken 


Pass  all  skills 


! 


I 


Ls\        (  Fsil  one  or  ^ 
J        I    more  skills  J 

f  Prescription  developed 
for  one  skill  in  unit  P 


Student  works  on 
instructional  materials 
for  one  skill 


K 


f  Pai 
I  ura 


CET  for  skill 
taken 

c 

Pass  CET  ^ 

^      Fall  CET 

Pass  CET  for  last 
utnastered  skill 


) 


Unit  Posttest 
Taken 


Pass  all  skills 


LUs\ C  Fall  one  or  \ 
J         I    more  skills  / 


Figure  9.4.1.    Flow  chart  of  steps  in  monitoring  student 
progress  in  the  IPI  program.  (Reproduced, 
by  permission,  from  Lindvall  and  Cox,  1969.) 


521 


/ 
J 


Rir 


-10- 

Let  us  now  consider  in  detail  the  four  kinds  of  tests  and  the 
method  for  student  diagnosis. 

Placement  Tests.    When  a  new  student  enters  the  program,  it  is 
necessary  to  place  the  student  at  the  appropriate  level  of  instruction 
in  each  of  the  content  areas.     (Glaser  &  Nitko,  1971,  called  this  stage- 
one  placement  testing, J    Typically,  this  is  dtfne  by  administering  a 
placement  test  that  covers  all  of  the  subject  areas  at  a  particular  level  (See 

4 

Table  9.4.1).      Factors  affecting  the  selection  of  a  level  for  placement 
testing  of  a  student  include  student  age,  past  performance,  and  teacher 
judgment.    Generally,  the  placement  test  covers  the  most  difficult  or 
most  characteristic  objectives  within  each  area.    Placement  tests  are 
administered  until  a  unit  profile  identifying  a  student's  competencies 
within  each  area  is  complete.    At  present,  the  somewhat  arbitrary  80-85% 
proficiency  level  is  used  for  most  tests  in  the  IPI  system. 

Student  test  scores  on  items  measuring  objectives  in  each  unit  and 
area  in  the  placement  test  are  used  to  develop  a  program  of  study.  The 
standard  procedure  is  to  assign  a  student  to  instruction  on  units  in 
which  placement  test  performance  on  items  measuring  a  few  representative 
objectives  in  the  units  is  between  20%  and  80%.     If  the  score  is  less 
than  20%  for  a  given  unit,  the  unit  test  in  the  area  at  the  next  lowest 
level  is  administered  and  the  same  criterion  is  applied.    In  the  case 
where  a  student  has  a  arre  of  80%  or  over,  testing  the  unit  in  the,  area 
at  the  next  highest  level  is  initiated. 

Next  we  will  consider  an  example.     In  Table  9.6. J  aro  shown  the  test  scores 
of  a  typical  student.    The  first  tests  administered  to  the  student  are  those 
measuring  objectives  in  Level  E.    What  instruction  will  be  prescribed?  What 
additional  testing  should  be  done? 

522 


-10.5- 


Table  9.4.2 

A  Set  of  Criterion-Referenced  Test  Scores 
for  a  Typical  Student 


Content  Area 

C 

—Level 
D 

lest— 
E 

F 

Numeration     

riace  vaxue 

 -  

60%  - 
90% 

60%  - 

Addition 

60% 

Subtraction 

< 

60% 

Multiplication  - 

s30% 

Division 

25% 

■ 

Combination  of  Processes 
Fractions 

5% 
90%  - 

10% 

Money 
Time 

<• 

0% 

50% 
10% 

Systems  of  Measurement 

85% 

40% 

0% 

Geometry 
Special  Topics 

30% 
30% 

523 


-11- 


Example 

On  the  basis  of  the  rules  described  above,  it  is  likely 
that  the- student:    -  -   

1.  would  be  prescribed  instruction  at  level  E  in  the  areas 
of  numeration,  addition,  subtraction, » multiplication, 
division,"  combination  of  processes,  money,  geometry, 
and  special  topics  ,  and 

2.  would  receive  the  level  F  placement  tests  in 
place  value  and  fractions. 

3.  If  the  student- scores  60%  and  10%  in  place  values  and 
fractions  respectively,  the  student  would  be  assigned 
to  receive  instruction  at  level  F  in  place  value  and  • 
probably  level  E  in  fractions. 

4.  The  student  would  also  be  administered  the  level  D 
placement  tests  in  the  areas  of  time  and  systems  of 
measurement • 

5.  If  the  student's  scores  were  0%  and  40%  in  the  areas 
of  time  and  systems  of  measurement,  respectively,  the 
student  would  receive  a  still  lower  placement  test  in 
the  area  of  time  and  would  be  prescribed  instruction 
at  level  D  in  systems  of  measurement. 

6.  If  the  student  scores  85%  on  the  level  C  placement 
test  in  the  area  of  time,  the  student  would  be 
assigned  to  level  D  for  instruction. 


In  order  to  acquire  some  information  on  the  average  length  of  the 
tests,  the  level  E  placement  tests  of  the  1972  edition  of  the  IPI  program 
were  selected  and  examined.    Analysis  revealed  that,  on  the  average, 
there  are  12  items  measuring  the  objectives  in  each  area  (with  a  range  of 
from  six  to  20). 

In  summary,  the  placement  test  has  the  following  characteristics: 
It  provides  a  gross  level  of  achievement  for  any  student  in  the  curriculum 


524 


and  it  provides  information  for  proper  placement  of  students  in  the  cur- 
riculum. 

Unit  Pretests  and  Posttests.    Having  received  an  initial  prescrip- 
tion of  units,  a  student  proceeds  next  to  take  a  pretest  for  a  unit  at 
the  lowest -level  of  mastery-in  his /lies-pro  file.-  -<Glaser  &  Nitko,  1971, 
call  this  stage-two    placement  testing.)    A  unit  pretest  includes  one  or 
more  items  to  measure  each  objective  in  the  unit.    A  review  of  the  unit 

o 

pretests  and  posttests  in  level  E  revealed  that  the  approximate  number  of 
items  on  a  test  is  37  (the  range  is  from  21  to  64)  and  the  average  number 
of  items  measuring  each  objective  is  six  (the  range  Is  from  four  to  seven). 
Lindvall  and  Cox  (1969)  report  that  the  length  of  a  pretest  is  determined 
by  the  number  of  objectives  in  the  instructional  unit  and  by  the  number 
of  items  used  to  test  each  objective.    No  fixed  number  of  items  to  measure 
e&ch  objective  is  used  because  of  the  diverse  nature  of  the  objectives. 
For  example,  they  note  that  M.  .  .  an  objective  like  'the  pupil  can  solve 
simple  addition  problems  involving  all  number  combinations'  will  require 
more  items  than  would  an  objective  like  'the  putil  must  select  which  of 
three  triangles  is  equilateral1  (p.  175 )." 

A  student  is  prescribed  instruction  in  each  objective  in  the  unit 
for  which  he/she  fails  to  achieve  an  85%  mastery  level  of  the  pretest.1 
In  the  case  where  students  demonstrate  mastery  of  each  objective,  they 
are  moved  on  to  the  next  unit  in  their  profiles,  where  they  again  take  a 
pretest. 

The  unit  posttests  are  simply  alternate  forms  of  the  unit  pretests 
and  are  administered  to  students  as  they  complete  instruction  on  the 

*A  mastery  score  on  each  objective  for  a  student  is  calculated  as 
the  percentage  of  items  on  the  test  measuring  the  objective  that  the 
student  answers  correctly. 

525 


-13- 

unit. .  A  student  receives  a  mastery  score  for  each  objective  in  the  unit.  He/ 
She  is  required  to  repeat  instruction  on  any  objective  where  he/she  fails 
to  achieve  an  85%  mastery  score.    The  student  is  directed  to  the  next  unit 
in  his/her  profile  if  he/she  demonstrates  mastery  on  each  objective  covered  In 
the  unit  posttest.    The  next  unit  prescribed  is  almost  always  one  at  the 
lowest  level  of  mastery  (or  grade  level).    Those  who  repeat  instruction  • 
on  one„or  more  of  the  objectives  must  take  the  unit  posttest  again  before 
moving  on  in  their  program. 

In  summary,  pretests  and  posttests  are  available  for  each  unit  of 
instruction.  The  proper  pretest  is  administered  on  the  basis  of  a  stu- 
dent's curriculum  profile,  and  learning  tasks  for  each  objective  (or 
skill,  as  it  is  called  in  the  IPI  program)  within  the  unit  are  assigned 
(or  not  assigned)  on  the  basis,  of  a  student's  performance  on  items  mea- 
suring  the  objective. 

Curriculum-Embedded  Tests.    As  the  students  proceed  through  a  unit 
of  instruction,  their  progress  is  monitored.    This  is  done  by  the  use  of 
curriculum-embedded  tests  (CET) .    As  used  in  the  mathematics  IPI  program, 
a  CET  is  primarily  a  measure  of  performance  on  one  specific  objective. 
There  are  usually  several  test  items  to  measure  the  objective.    A  review 
of  the  CETs  in  level  E  of  the  program  revealed  that  there  are,  on  the 
average,  about  three  items  measuring  the  primary  objective  covered  in  the 
CET.    The  range  is  from  two  to  five  items.    If  a  student  receives  a  score 
of  85%,  the  student  is  permitted  to  move  on  to  the  next  prescribed  ob- 
jective.   Otherwise,  the  student  is  sent  back  for  additional  work  before 
taking  an  alternate  form  of  the  CET. 

A  second  purpose  of  the  CET  is  to  assess,  albeit  in  a  fairly  crude 
way,  whether  or  not  the  student  has  mastered  the  next  objective  in  the 


526 


specified  sequence  for  studying  the  objectives  covered  in  the  unit.  If 
the  second  objective  included  in  the  CET  is  not  one  the  student  has  been 
assigned  to  study,  the  student  is  moved  on  to  be  pretested  on  the  second 
half  of  a  CET  that  covers  the  next  objective  in  the  student's  program  of 
study.    Regardless  of  which  CET  a  student  takes,  if  a  score  of ■  S5% [  or 
over  is  Achieved  on  the  items  tested,  instruction  on  the  objective  is  not 
required,    fnnrnt^rjj  Iffi  rrfn  means  that  a  student  must  score  100%  since 
there  are  normally  only  about  two  '.terns  included  in  the  test  to  cover  the 
second  objective.    This  additional  pretesting  of  an  objective  in  the  CET 
gives  students  a  chance  to  demonstrate  mastery  of  new.  skills  not  speci- 
fically covered  in  the  instruction  up  to  that  print  and  to  eliminate  that 
instruction  from  their  programs. 

Student  Diagnosis.    Once  the  student  has  been  assigned  to  a  unit  of 
instruction  and  the  objectives  for  which  instruction  is  needed  have  been 
identified  from  the  unit  pretest  data,  there  still  remains  the  problem  of 
deciding  which  of  several  instructional  methods  is  "optimal."    That  is, 
of  the  available  instructional  methods  for  a  particular  instructional  unit, 
in  which  of  them  would  a  student  with  a  known  background  in  the  program, 
and  specific  goals,  interests,  and  aptitudes,  stand  the  "best"  chance  of 
learning  the  material?    Glaser  and  Nitko  (1971)  call  this  a  diagnostic 
decision. 

9»4>3    Summary  Comments 

The  Individually  Prescribed  Instruction  program  is  a  highly  structured 
system  of  individualizing  instruction  that  has  become  a  model  for  liter- 
ally hundreds  of  other  developers  of  individualized  programs. 


527 


-15- 


9.5    Mastery  Learning 

The  mastery  learning  concept  was  introduced  to  American  schools 
in  the  1920' s  with  the  work  of  Washburne  (1922)  and  others  in  the  format 
of  the  Winnetka  Plan.    The  program  flourished  in  the  1920' s;  however, 
without  the  technology  to  sustain  a  successful  program,  interest  among 

-developers- and_implementer s  st eaiiily_dlmijlished  Qftftgfca.  1971) .  According 

to  Block  (1971),  mastery  learning  was  revived  in  the  form  of  programmed 
instruction  in  the  late  1950' s  in  an  attempt  to  provide  students  with 
instructional  materials  that  would  allow  them  to  move  at  thei"  own  pace 
and  receive  constant  feedback  on  their  level  of  mastery.    But  jammed 
instruction  was  not  effective  for  all  students, ^and  so,  in  an  attempt  to 
handle  individual  differences  better,  Bloom  (1968)  and  his  students 
(Airasian,  1971 J  Block,  197U  improved  on  the  standard  programmed  in- 
struction model  by  combining  it  with  a  model  of  school  learning  developed 
by  Carroll  (1963,  1970).    Carroll's  model  of  school  learning  provided  the 
conceptual  framework  for  more  effective  handling  of  individual  differences 
within  an  objective-based  curriculum.    In  brief,  Carroll's  model  states 
that  the  level  of  mastery  reached  by  a  student  on  any  instructional  task 
or  school  objective  is  a  function  of  the  time  actually  spent  learning  the 
material  and  the  amount  of  time  the  student  needs  to  master  the  material. 
The  amount  of  time  a  student  actually  spends  learning  the  material  depends 
on  two  factors— time  allowed,  and  perseverance.    The  amount  of  time  needed 
by  the  student  is  dependent  on  three  factors— aptitude,  quality  of  the 
Instructional  materials,  and  the  student's  ability  to  understand  the 
instructional  materials.    Carroll  goes  on  to  explain  how  these  five  factors 
interact  to  effect  student  success  in  school  learning. 


ERIC 


528 


-16- 

Since  Bloom's  original  pap^r  in  1968  describing  mastery  learning, 
a  considerable  amount  of  mastery  learning  research  has  been  conducted, 
and  the  results  suggest  that  the  mastery  learning  model  can  be  easily 
and  inexpensively  implemented  in  courses  at  any  level  of  education  and 
in  a  wide  range  of  content  areas  (Block,  1970).    In  particular,  Block 
(1971)  notes  that  the  best  results  have  been  obtained  when  the  course 
requires  either  minimal  prior  learning,  or  previous  learning,  which  all 
or  almost  all  of  the  students  possess..    In  addition,  various  research 
findings  have  shown  better  results  in  courses  when  the  content  is  highly 
structured  and  sequential  in  nature.    The  mastery  learning  model  has 
been  used  successfully  now  with  more  than  100,000  students  in  elementary, 
secondary,  and  college-level  courses.    The  100,000  figure  is  a  conserva- 
tivl^  one.    Mastery  learning  programs  are  being  introduced  all  over  the 
world,  and  it  is  no  longer  possible  to  keep  up  with  the  scope  and  size 
of  each. 

The  outstanding  features  of  mastery  learning  appear  to  be  that  it 
is  easily  implementable,  does  not  require  the  use  of  a  computer  to  manage 
instruction,  and  is  appropriate  for  almost  any  content  area.    Also,  if 
mastery  learning  is  carried  out  properly,  previous  research  suggests  that 
students  will  achieve  higher  scores  and  have  more  interest  and  a  better 
attitude  toward  school. 

9.5.1    Instructional  Paradigm 

The  curriculum  is  organized  into  units  of  instruction  defined  by 
homogeneous  clusters  of  objectives.     Initial  instruction  on  the 
objectives  covered  in  the  unit  is  group-based.     In  this  respect,  mastery 
learning  is  structurally  different  from  IPI.     For  each  unit,  one  or  more 
criterion-referenced  tests,  called  formative  tests,  are  used  to  assess 

er|c  529 


-17- 


mastery  of  the  objectives.    These  tests  an-  administered  immediately 
following  the  completion  of  the  group-based  Instruction.  Indivrdualiza 
tion  is  handled  via  supplemental  materials,  feedback,  and  corrective 
techniques,  applied  to  students  who  fail  to  achieve  the  defined  level\$>f 
mastery  on  the  test  items  covering  the  unit  objectives.    Following  the 
last  unit  of  instruction  in  the  course,  a  final  test  covering  a  repre- 
sentative sample  of  course  objectives  is  administered,  and  the  data  used 
for  grading  purposes. 

In  describing  the  mastery  learning  model,  Mayo  (1970)  notes  that: 

1.  Students  are  made  aware  of  course  and  unit  expectations,  so 
that  they  view  learning  as  a  cooperative  rather  than  as  a 
competitive  venture. 

2.  Standards  of  mastery  are  set  in  advance  for  the  students, 
and  grading  is  in  terms  of  absolute  performance  rather  than 
relative  performance. 

3.  Short  diagnostic  tests  are  used  at  the  end  of  each  instruc- 
tional unit. 

A.  Additional  learning  is  prescribed  for  those  who  do  not 

demonstrate  unit  mastery. 
5.  Additional  time  for  learning  is  prescribed  to  students  who." 

seem  to  need  it. 

In  summary,  there  are  many  variations  on  the  basic  mastery  model, 
as  originally  proposed  by  Bloom  (1968).    For  example,  different  imple- 
mented tend  to  vary  in  the  extent  to  which  feedback/correction  proce- 
dures are  available  and  used  (Block,  1971).     In  the  next  section,  the 
decision  points    in  the  program  will  be  considered. 

530 


■ 


-18- 

9.5.2    Testing  Model  Description 

Block  (1971)  notes  that  "To  individualize  instruction  within  the 
context  of  ordinary  group-based  instruction,  mastery  learning  relies 
heavily  on  the  constant  flow  of  feedback  information  to  teacher  and 
learner  (p.  9). 11    However,  it  would  seem  that  there  is  substantially  less 
testing  in  a  mastery  learning  program  than  in  IPI.    A  flow  chart  of  the 
testing  model  is  shown  in  Figure  9,5.1. 

As  compared  to  IPI,  there  is  no  placement  testing,  and  unit  pre- 
testing and  curriculum-embedded  testing  are  not  emphasized.    Unit  posT- 
testing  and  final  assessment  represent  the  two  major  kinds  of  testing 
in  the  program.    Tests  to  achieve  these  two  purposes  are  called  11  format iveM 
and  "summative"  tests,  respectively.    Formative  tests,  or  unit  posttests 
as  they  are  called  in  IPI,  are  not  used  for  grading.    The  student  data 

derived  from  a  formative  test  is  used  exclusively  for  diagnosing  learning 

\  ' 

difficulties. 

\ 

Formative  Tests.    A  formative  teat,  or  alternately  called  a  diagnostic- 
progress  test,  is  a  criterion-referenced  test  that  is  designed  to  cover 
the, objectives  over  a  unit  of  instruction  in  the  mastery  learning  program. 
It  is  used  to  determine  whether  qr  not  a  student  has  mastered  thfe  material 
and  to  serve  as  a  basis  for  prescribing  supplemental  work  in  areas  where 
the  student  is  weak  (Airasian,  1971).     It  is  expected  also,  that  the  test 
will  reinforce  the  learning  of  high-achieving  students.     Implementers  of 
the  mastery  learning  model  have  set  the  passing  standard  anywhere  from 
75%  to  100%.    There  is  no  set  number,  of  items  or  format  suggested  to 
measure  each  objective;  in  addition,  there  is  a  suggestion  that  instruc- 
tional decisions  are  made  on  the  basis  of  responses  to  individual  items. 


9 

ERIC 


531 


-19- 


{ 


Unit  In  program 
•elected  for  study 


) 


Group-based 
on  the  o 

in  thi 

instruction 
bjectives 
*  unit 

Unit  Posttest 
Taken  (Fornatlve  Test)  1 

( 


Pass  all  skills 


No 


Free  study  \ 
tiae,  tutoring  j 
others ,  etc.  J 


(  Last  unit  of  A 
\  instruction?  J 


Yes 


Final  assessment i 
Summative  Test 


c 


Fail  one  or 
©ore  skills 


Prescription  developed: 
Use  of  alternative 
resources 


Figure  9.5.1. 


Flowchart  of  steps  In  monitoring  student  progress 
in  a  typical  version  of  a  mastery  learning  program. 


ERIC 


532 


The  formative  tests  in  mastery  learning  represent  the  key  to 
individualizing  instruction  since  it  is  on  the  basis  of  the  scores  on 
these  tests  that  individualization  of  instruction  can  take  place.  Units 
are  kept  small  so  that  unit  testing  takes  place  frequently  in  order  to 
increase  the  effectiveness  of. the  individualization  of  instruction  com- 
ponent of  the  program. 

Although  it  remains  an  unresolved  problem,  the  matter  of  setting 
mastery  levels  or  cutting  scores,  by  which  students  can  be  separated 
into  mastery  and  non-mastery  states  on  the  basis  of  their  performance 
on  test  items  designed  to  measure  objectives  included  in  the  criterion- 
referenced  tests,  has  been  more  actively  researched  in  the  context  of 
the  mastery  learning  program  than  anywhere  else.    In  addition  to  the 
usual  concern  for  setting  mastery  levels  high  enough  to  guarantee  that 
students  will  have  the  necessary  preparation  to  begin  the  next  segment 
of  instruction,  Block  (1970)  has  noted  that,  in  mastery  learning,  the 
mastery  level  is  set  in  a    way  that  will  maximize  interest  in  and  attitude 
toward  learning.    Some  interesting  controlled  research  studies  have 
revealed  that  a  mastery  level  of  about  80-85%  is  substantially  better 
than  a  level  that  is  higher  or  lower.     Block's  results  suggest  that  setting 
mastery  levels  high  (95%)  may  be  best  for  cognitive  learning  but,  in  the 
long  run,  positive  attitudes  and  interest  in  the  subject  are  less  likely 
to  develop.    With  a  reduction  in  the  mastery  level  to  85%,  there  was  a 
reductidn  in  cognitive  learning,  but  selected  affective  outcomes  were 
maximized.     If  the  mastery  level  is  set  lower  than  80-85%,  students  do 
not  usually  have  sufficient  mastery  of  the  skills  to  proceed  effectively 
with  the  instruction. 

533 


Surcmative  Tests,    The  primary  purpose  of  the  summative  test  in 
the  mastery  learning  model  is  to  grade  students  on  the  basis  of  their 
achievement  of  course  objectives.    The  items  in  the  test  are  keyed  to 
objectives  and  are  selected  to  be  representative  of  the  'total  pool  of. 
course. objectives.    A  criterion-referenced  interpretation  of  the  scores 
is  recommended.    Bloom  (1971)  proposed  that  cutting  points  be  located 

« 

on  the  ability  continuum  and  that  grades  should  be  assigned  on  the* 

\ 

c 

basis  of  a  student's  position  on  the  continuum  and  not  relative  to  other 
students  in  the  Course.    A  norm-referenced  interpretation  of  the  sctores 
is  also  possible.  „ 
Assignment  to  Instructional  Modes.    A  key  part  of  the  mastery 

learning  program  is  the  availability  of  an  extensive  number  of  instruc- 
ts 

tional  methods  for  use  by  students  who  fail  to  demonstrate  mastery  of 
the  objectives  covered  on  the  formative  test.    A  formative  test  is  ad- 

it 

ministered  at  the  end  of  the  group-based  instruction  on  the  unit  objectives. 

Among  the  alternative  resources  that  are  typically  available  to 
the  student  are:    Small-group  problem  sessions,  individual  tutoring,  and 
alternative  learning  materials,  such  as  alternate  textbooks,  workbook's,  , 
programmed  instruction,  audiovisual  methods,  academic  games  and  puzzles, 
and  re teaching.  N  0 

The  developers  of  the  program  have  left  the  decision  on  the  appro- 
priate instructional  correctives  to  the  student.     It  is  expected  that, 
through  experimentation  with  many  of  the  instructional  correctives,  the 
student  will  eventually  learn  which  is  "best."    This  would  seem  to  be  a 
very  realistic  solution  to  the  problem  because  of  the  shortage  of  avail- 
able data  on  the  appropriate  matches  between  student  characteristics  and 
instructional  correctives.  * 

» 

534 


22- 


9.5.3  Summary 

Mastery  learning  is  less  different  from  conventional  instruction 
than  IP I  since  initial  instruction  on  objectives  in  a  mastery  learning 
program  is  group-based  and  final  grades  are  assigned.    On  this  latter 
point,  however,  it  should  be  noted  that  because  of  the  organization  of 
the  curriculum  and  the  approach  to  test  development  and  test  score 
interpretation,  it  is  unlikely  that  the  final  assessment  is  as  threaten- 
ing a  situation  to  the  student  as  it  usually  is  in  more  conventional 
programs.    As  compared  to  conventional  instructional  programs,  mastery 
learning  programs  include  features  such  as  individual  pacing,  the  fre- 
quent use  of  criterion-referenced  tests  on  small  units  of  instruction 
to  diagnose  learning  problems,  and  feedback/ corrective  techniques. 


535 


9 . 6  Summary 

The  successful  implementation  of  an  individualized  instructional 
program  depends,  in  part,  upon  the  availability  of  appropriate  testing 
and  decision-making  procedures  to  monitor  student  progress.    In  this 
unit  we  have  described  and  compared  the  testing  models  of  two  of  the 
best  known  and  widely  adopted  instructional  programs:    IPI,  and  Mastery 
Learning. 


536 


-24- 

9.7    References  Cited 


Airasian,  P.  W.    The  role  of  evaluation  in  mastery  learning.    In  J.  H. 
Block  (Ed.),  Mastery  learning;    Theory  and  practice.    New  York: 
Holt,  Rinehart  and  Winston,  1971. 

Baker,  F.  B.    Computer-based  instructional  management  systems:    A  first 
look.    Review  of  Educational  Research.  1971,  41,  51-70. 


Block*  J.  H.    The  effects  of  various  levels  of  performance  on  selected 
cognitive,  affective,  and  time  variables.    Unpublished  doctoral 
dissertation,  University  of  Chicago,  1970. 

Block,  J.  H.  (Ed.)    Mastery  learning:    Theory  and  practice.    New  York: 
Holt,  Rinehart  and  Winston,  1971. 

Bloom,  B.  S.    Learning  for  mastery.    Evaluation  Comment.  1968,  1(2). 

Bloom,  B.  S.    Mastery  learning.    In  J.  H.  Block  (Ed.),  Mastery  learning: 
Theory  and  practice.    New  York:    Holt,  Rinehart  and  Winston,  1971. 

Bloom,  b#  s.     Human  characteriotics  and  instruction:    A  theory  of  school 
learning.    New  York:    McGraw-Hill,  1976. 

Carroll,  J.  B.  A  model  of  school  learning.    Teachers  College  Record, 
1963,  64,  723-733. 

Carroll,  J.  B.    Problems  of  measurement  related  to  the  concept  of  learn- 
ing for  mastery.     Educational  Horizons.  1970,  48,  71-80. 

Cooley,  W.  W. ,  /i  Glaser,  R.    The  computer  and  individualized  instruction. 
Science,  1969,  166,  574-582. 

Cronbach,  L.  J.  How  can  instruction  be  adapted  to  individual  differences? 
In  R.  M.  Gagne"  (Ed.),  Learning  and  individual  differences.  Columbus, 
Ohio*    Charles  E.  Merrill,  1967. 

Gibbons,  k.    What  is  individualized  instruction?    Interchange,  1970,  1, 
28-52. 

Glaser,  R.    Adapting  the  elementary  school  curriculum  to  individual 

performance.     In  Proceedings  of  the  1967 . Invitational  Conference 

on  Testing  Problems.    Princeton,  N.J.:    Educational  Testing  Service, 

1968. 

Glaser,  R. ,  &  Nitko,  A.  J.    Measurement  in  learning  and  instruction.  In 

R.  L.  Thorndike  (Ed.),  Educational  measurement.  (2nd  ed.)  Washington: 
American  Council  on  Education,  1971. 

Gronlund,  N.  E.    Individualizing  classroom  instruction.    New  York: 
Macmillan  Publishing  Co.,  1974. 


53V 


-25- 


Harableton,  "R.  K.,  Swarainathan,  H. ,  Algina,  J.,  &  Coulson,  D. •  Criterion- 
referenced  testing  and  measurement:    A  review  of  technical  issues 
and  developments.    Review  of  Educational  Research,  1978,  _48,  1-47. 

Lindvall,  C.  M. ,  &  Cox,  R.    The  role  of  evaluation  in  programs  for 

individualized  instruction.    In  R.  W.  Tyler  (Ed.),  Educational 
evaluation;    New  roles,  new  means.    Sixty-eight  Yearbook,  Part  II. 
Chicago:    National  Society  for  the  Study  of  Education,  1969. 

Lindvall,  C.  M. ,  Cox,  R.  C,  &  Bolyin,  J.  0.    Evaluation  as  a  tool  in 
curriculum  development:    The  IPI  evaluation  program.    AERA  Mono- 
— graph "Series  on ;  Curriculum  Evaluation ,  No ."  5    'jCft'lcago'S  Rand 
McNally,  1970. 

Mayo,  S.  T.    Mastery  learning  and  mastery  testing.    NCME  Measurement  in 
Education,  1970,  1.,  3. 

Millman,  J.    Criterion-referenced  measurement.    In  W.  J.  Popham,  (Ed.), 
Evaluation  in  education:    Current  practices.    Berkeley,  Calif.: 
McCutchan  Publishers,  1974. 

Popham,  W.  J.    Criterion-referenced  measurement .    Englewood  Cliffs,  N.J.: 
Prentice-Hall,  1978. 

WaUiburne,  C.  W.    Educational  measurements  as  a  key  to  individualizing 
instruction  and  promotions.    Journal  of  Educational  Research, 
1922,  5,  195-206. 


538 


0 


-26- 


9.8     References  for  Further  Study 

Block,  J.  H.  (Ed.)    Schools,  society  and  mastery  learning.    New  York: 
Holt,  Rinehart,  &  Winston,  1974. 

Block,  J.  H. ,  &  Anderson,  L.  W.    Mastery  learning  in  classroom  instruction. 
New  York:    Macmillan,  1975. 

Bloom,  B.  S.    Human  characteristics  and  instruction:    A  theory  of  school 
learning.    New  York: — McGraw-Hill  r  1976* 

Davies,  I.  K.    Competency  based  learning:    Technology,  management,  and 
design.    New  York:    McGraw-Hill,  1973.     (A  practical  book  for 
teachers  providing  an  introduction  to  the  field  of  instructional 
systems.)  I 

Glaser,  R.    Adaptive  instruction:    Individual  diversity  and  learning. 
New  York:    Holt,  Rinehart,  and  Winston,  1976. 


Klausmeier,  H.  J.,  Rossmiller,  R.  A. ,  &  Saily,  M.     Individual Ar  guided 

elementary  education:    Concepts  and  practices.    New  York1?  Academic 
Press,  1977.     (The  book  provides  an  excellent  coverage  o£  the 
theory  and  practice  on  Individually  Guided  Instruction  which  was 
developed  by  the  University  of  Wisconsin  Research  and  Development 
Center  for  Cognitive  Learning.) 

Torshen,  K.  P.    The  mastery  approach  to  competency-based  education. 

New  York:    Academic  Press,  1977.     (The  book  provides  readers  with 
a  good  up-to-date  review  of  the  theory  and  research  related  to 
competency-based  instruction. ) 


539 


ERIC 


Unit  10 

New  Developments  and  Areas  for  Further  Research1 


 Prepared  By 

Ronald  K.  Hambleton 
University  of  Massachusetts,  Amherst 

and 

Daniel  R.  Eignor 
Educational  Testing  Sei*vioe 


March  15,  1979 


Substantial  portions  of  material  in  the  unit  are  from  Hambleton, 
R.  K. ,  Swaminathan,  H. ,  Algina,  J.,  and  Coulson,  D.  Criterion-referenced 
testing  and  measurement:     A  review  of  technical  issues  and  developments. 
Review  of  Educational  Research,  1978,  48,  1-47. 


9 

ERIC 


540 


Table  of  Contents 

Page 

10.0  Overview  of  the  Unit   1 

10.1  Important  Developments  and  Areas  for  Further 

Research  and  Development   2 

10  . 2    References   8 


< 


ERIC 


541 


\ 


-1- 


\ 

I 

10.0  Overview  of  the  Unit 

The  purpose  of  this  unit  is,  to  introduce  practitioners  to  several 

i 

important  new  developments,  and  t<^  several  important  criterion-referenced 
testing^Tbpics  that. have  not  been  satisfactorily  resolved. 


542 


10.1  Important  Developments  and  Areas  for 
Further  Research  and  Development 

One  of  the  most  pressing  problems  for  measurement  specialists  in 
the  1970' s  has  been  the  necessity  to  produce  criterion-referenced  test 
technology  and  instruments— quickly !  Unfortunately,  the  desire  of  many 
individuals,  organizations,  and  agencies  to  use  criterion-referenced 
tests  has  far  exceeded  the  testing  profession's  ability  to  produce  test 
development  standards  and  high  quality  instruments  to  meet  this  need. 
As  a  consequence,  classroom  teachers  have  been  using  "home-made"  or 
commercially  prepared  criterion-referenced  tests  (which,  in  most  in- 
stances, should  be  called  "objectives-referenced  tests")  of  undetermined 
quality  to  make  instructional  decisions;  program  evaluators  (recognizing 
shortcomings  of  norm-referenced  tests  in  program  evaluation  activities) 
have  been  constructing  criterion-referenced  tests  based  on  the  "best" 
principles  they  can  find  in  a  body  of  literature  that  is  confusing, 
contradictory,  and  massive  in  size  (with  more  unpublished  than  published 
papers  being  circulated) ;  and  professional  licensing  organizations  Have 
been  grappling  with  issues  such  as  test  score  validity  and  determination 
of  cut-off  scores,  in  the  midst  of  complicated  legal  actions  by  the 
courts.    All  of  the  above,  as  well  as  many  other  factors,  have  contributed 
to  a  highly  unsettled  and  volatile  situation. 

It  appears  now  that  there  is  sufficient  theory  and  practical  guide- 
lines for  implementing  at  least  adequate  criterion-referenced  testing 
programs  in  situations  as  far  ranging  as  objectives-based  instructional 
programs  at  the  classroom  level,  program  evaluations  at  the  district  and 
statewide  level,  and  competency-based  certification  programs  at  the  state 
and  national  level. 

543 


-3- 


9 

ERIC 


What  important  criterion-referenced  testing  developments  and  areas 
for  research  have  emerged?    There  appear  to  be  several.    One,  behavioral 
objectives  are  being  replaced  by  "amplified  objectives"  (Millman,  1974) 
or  domain  specifications  (Popham,  1978).    This  shift  is  one  of  the  most 
important  developments" because  it  has  implications  for  the  quality  of  the 
descriptions  that  can  be  made  from  criterion-referenced  test  scores. 
Objectives-referenced  tests  are  being  produced  by  man?  schools  and  com- 
mercial test  publishers,  and  these  tests  have  value,  but  they  do  not  per- 
mit generalizations  from  the  test  scores.    Since  it  is  likely  that 
objectives-referenced  tescs  will  continue  to  be  produced,  it  is  important 
for  consumers  to  be  familiar  with  both  criterion-referenced  tests  and. 
objectives-referenced  tests  and  the  proper  interpretations  of  scores 
derived  from  each  type  of  test.    At  this  stage,  there  are  only  a  few 
good  examples  of  domain  specifications.    These  are  available  from  James 
Popham,  Eva  Baker  and  staff  at  the  Center  for  the  Study  of  Evaluation 
at  UCLA,  and  Richard  Anderson  and    several  of  his  colleagues  at  the 
University  of  Illinois.    Many  more  domain  specifications  are  under 
development  at  various  sites  around  the  country,  and    more  will  come 
because  the  Basic  Skills  Group  at  the  National  Institute  of  Education 
has  specified  the  area  as  one  of  its  priorities. 

Two,  the  role  and  process  of  item  analysis  in  test  development 
work  seem  substantially  more  clear  now.    Our  review  of  emerging  trends 
in  this  area  suggests  that  two  types  of  information  should  collected: 
Item  ratings  (obtained  from  any  one  of  many  possible  formats)  of  cortent 
specialists,  and  item  statistics  (of  a  wide  variety  of  kinds)  derived 
from  samples  of  examinee  test  item  responses.    Content  specialists  need 
to  address  two  basic  questions.    One,  are  the  domain  specifications  clear 


544 


to  potential  users  and  item  writers?    Two,  is  the  sample  of  items  selected 
for  inclusion  in  a  criterion-referanced  test  representative  of  the  items 
defined  by  a  domain  specification?    On  the  other  hand,  item  statistics 
derived  from  examinee  response  data  may  be  used  to  detect  "flaws"  (for 
example,  technical  flaws  in  items,  such  as  ambiguous  wording).    A  key 
point  emerging  from  recent  literature  is  that  item  statistics  should 
not  usually  be  used  in  item  selection  since  such  a  strategy  introduces 
a  "bias19  that  could  reduce  the  validity  of  scores  derived  from  such  a 
test.    The  one  important  exception  to  the  rule  occur?' when  the  single 
purpose  of  a  test  is  to  produce  scores  to  make  mastery/non-mastery 
decisions.    A  better  test  can  be  obtained  if  test  items \which  discriminate 
in  the  region  of  the  desired  cut-off  score  are  selected.  \ 

Three,  a  significant  development  is  the  recognition  of  the  need 
for  construct  validation  studies  with  criterion-referenced  tests  (Linn, 
1977;  Messick,  1974).    The  size  of  the  test  development  project  will 
influence  the  scope  and  number  of  construct  validation  studies,  but 
clearly  more  work  is  needed  in  this  area  than  has  been  done  in  the 
past.    Experimental  studies,  factor  analyses,  and  investigations  of 
potential  sources  of  low  test  score  validity  represent  directions  for 
this  future  research.    The  limit  of  these  studies  will  be  the  level  of 
creativity  and  ingenuity  of  the  researchers  involved  (Hambleton,  1977b). 

Four,  with  respect  to  the  technical  topics  of  test  length  and 
reliability,  there  are  numerous  useful  contributions  available.  More 
work  seems  to  be  needed  though  with  regard  to  assumptions  underlying 
these  technical  developments,  but  generally  the  work  in  these  areas 
is  sound.  r~  m 

545 


*  > 

-5- 

The  matter  of  determining  cut-off  scores  seems  less  clear  at  this 
time  (see,  for  example,  Glass,  1978).    Aside  from  the  concern  about  whether 
cut-off i scores  should  ever  be  used,  at  present  there  are  few  procedures 
for  sorting  through  the  numerous  approaches  for  determining  cut-off  scores 

i  ' 

i 

for  the, purpose  of  selecting  one.    Implementation  strategies  for  nearly 
all  of  the  approaches  are  also  lacking. 

Five,  there  are  numerous  Bayesian  statistical  method  contributions  , 
offered  for  improving  the  precision  of  domain  score  estimation  and  allo- 
cating examinees  to  mastery  states.    The  decision-theoretic  procedure 
outlined  earlier  provides  a  framework  within  which  Bayesian  statistical 
methods  can  be  employed  with  criterion-referenced  tests.    The  incorporation 
of  losses  introduces  the  decision-maker's  values  into  the  decision  pro- 
cess.    The  Bayesian  methods  incorporate  the  prior  knowledge  of  the  deci- 
sion maker  and  utilize  the  data  from  all  examinees,  thereby  effectively  \ 
increasing  the  amount  of  information  the  decision  maker  has  without  re- 
quiring the  administration  of  additional  test  items.    There  are  a  growing 
number  of  impressive  results  to  support  continued  activity  in  this  area 
(for  example,  Hambleton,  Hutten,  and  Swaminathan,  1976;  Novick  and  Jackson, 
1974;  Novick  and  Lewis,  1974).    However,  questions  about  the  overall 
gains  that  might  accrue  in  view  of  the  complexity *  of  the  procedures,  the 
robustness  of  the  Bayesian  models  in  testing  situations  where  the  under- 
lying assumptions  of  the  model  are  not  met  (for  example,  when  one  has  / 
very  short  tests),  and  the  sensitivity  of  the  Bayesian  models  to  the  .  / 
specification  of  priors,  need  to  be  addressed. 

Six,  a  problem  which  has  not  been  studied  at  all  in  the  context  of 
criterion-referenced  testing,  is  an  instance  of  the  bandwidth-fidelity 


546 


dilemma  (Cronbach  and  Gleser,  1965).    When  faced  with  making  a  number 
of  decisions  of  varying  importance, and  with  a  limited  amount  of  testirtg 
time  available*  how  does  a  test  developer  go  about  determining  the  "best" 
distribution  of  testing  time?    Dees  one  try  to  collect  considerable  test 
data  to  make  the  few  most  important  decisions,  or  does  Qne  try  to  dis- 

V 

tribute  the  available  testing  time  in  such  a  way  as  to  collect  a  little 
information  relative  to  each  decision?    A  solution  to  this  problem  is  - 
required  for  an  efficient  testing  program.    Determination  of  test  lengths 
for  each  domain  without  regard  for  the  size  and- scope  of  the  total  testing 
program  could  produce-  a  serious  imbalance  between  testing  and  instructional 
time. 

it  -  ■ 

Seven,  when  a  set  of  objectives  can  be  arranged  into  a  learning 
hierarchy,  the  strategy  of  branched  testing  would  seem  to  offer  consider- 
able potential  for  decreasing  the  amount  of  testing  while  improving  its 
quality  (Ferguson,  1969;  Hambleton  and  Eignor,  1977;  Spineti  and  Hambleton, 
1977;  and  Wood,  1973).    Some  of  the  practical  problems  have  been  resolved 
in  the  Pittsburgh  IPI  Program  so  that  the-  technique  can  now  be  used  on  a 
limited  basis.    Nevertheless,  many  problems  remain  before  adoption  should 
or  can  proceed  on  a  large-scale  basis.    For  example,  it  will  be  necessary 
to  develop  a  nonautomated  modified  version  of  branched  testing  for  schools 
without  computers.    Also,  we  need  to  know  more  about  setting  starting 
places,  step  sizes,  stopping  rules,  etc,  before  branched  testing  can  be 
used  effectively. 

Other  matters  requiring  attention  (offered  without  elaboration)  are 
techniques  for  reporting  criterion-referenced  test  score  information 
(Ferguson  and  Novick,  1973;  Millman,  1970);  the  use  of  norms  with  criterion- 

547 


referenced  tests  (Pophara,  1976);  applications  of  latent  trait  models  for 
the  construction  of  criterion-referenced  tests, 'and  evaluations  and  interpre- 
tations  of    these   criterion-referenced  test  scores;, and  the  nature  , and  ( 
scope  of  training  programs  for  criterion-referenced  test  developers  and 
users  (Hambleton,  1977a).  c  4 

Consideration  was  given  in  our  units  to  topics^  such  as  preparing 
objectives,  developing,  and  validating  tests,  determining  reliability, 

'  c 

setting  cutting  scores,  and  using  criterion-referenced  test  scores.  / 

i  I 

Hopefully.,  our  materials  will  facilitate  the  continued  development  and 
improvement  of  criterion-referenced  testing.    :While  our  list  of  suggested 
research  and  development  activities  above  is  not  intended  to  #e  compre- 
hensive, problem  areas  suggested' above  are  among  the  more  important  ones  . 
requiring  resolution  in  the  .coming  years.    Our  list  should  be  useful  as 
a  guide  for  directing  some  future  work. 

In  conclusion,  there  are  few  criterion-referenced  tests  available      ■  3 
that  can  meet* today's  standards  for  .test  development,  validation,  and  , 
usage.    The  Rood  news  is  that  the  technology  is  now  sufficiently  well- 
developed  to  improve  this  situation.     It, will' be  interesting  to  see  what 
happens* 


548 


10,2  References 


Cronbach,  L.  L. ,  &  Gleser,  G.  C,    Psychological  tests  and  personnel  deci- 
sions*    (2nd  ed.)    Washington:    American  Council  on  Education, 
I97K 

Ferguson,  R.  L.    The  development,  implementation  and  evaluation  of  a 
computer-assisted  ^Branched  test  for  a  program  of  individually 
prescribed  instruction.    Unpublished  doctoral  dissertation, 
University  of  Pittsburgh,  1969. 

Ferguson,  R.  L.,  &  Novick,  M.  R.    Implementation  of  a  Bayesian  system 
for  decision  analysis  in  a  program  of  individually  prescribed 
instruction.    ACT  Research  Report  No.  60.    Iowa  City,  Iowa: 
American  College  Testing  Program,  1973. 

Glass,  G.,  V.    Criteria  and  standards.    Journal  of  Educational  Measurement, 
1978,  15,  217-261. 

Hambleton,  R.  K.    What  classroom  teachers  need  to  know  about  criterion- 
referenced  testing.    Laboratory  of  Psychometric  and  Evaluative 
Research  Report  No.  50.    Amherst,  Mass.:    School  of  Education, 
University  of  Massachusetts,  1977*  (a) 

Hambleton,  R.  K.     Validation  of  criterion-referenced  test  score  inter- 
pretations.   A  paper  presented  at  the  Third  International  Symposium 
on  Educational  Testing,  University  of  Leyden,  The  Netherlands, 
1977.  (b) 

Hambleton,  R.  K. ,  &  Eignor,  D.  R.    Adaptive  testing  applied  to  hier- 
archically structured  objectives-based  curricula.  Proceedings 
of  the  Second  Computerized  Adaptive  Testing  Conference,  University 
of  Minnesota,  1977. 

Hambleton,  R.  K. ,  Hutten,  L.  R.,  &  Swarainathan,  H.    A  comparison  of  several 
methods  foi*  assessing  student  mastery  in  objoitives-based  instruc- 
tional programs.     Journal  of  Experimental  Education,  1976,  4J>,  57-64. 

Linn,  R.  L.     Issues  of  validity  in  measurement  for  competency-based 

programs.    Paper  presented  at  the  annual  meeting  of  the  National 
Council  on  Measurement  in  Education,  New  York,  1977. 

Mi-sslck,  S.  A*     The  .standard  problem:     Meaning  and  values  in  measurement 
and  evaluation.     American  Psychologist,  1975,  JJO,  955-966. 

Miilman,  J.    Reporting  student  progress:    A  case  for  a  criterion- 

referended  marking  system.     Phi  Delta  Kappan,  1970,  52^,  226-230. 

Miilman,  J.     Passing  scores  and  test  lengths  for  domain-referenced 
measures.     Review  c \  Educational  Research,  1973,  A3,  205-216. 


549 


-9- 


Millman,  J.    Criterion-referenced  measurement.    In  W.  J.  Popham  (Ed.), 
Evaluation  in  education:    Current  applications.  Berkeley, 
California:    McCutchan  Publishing  Co. ,  1974. 

Novick,  M.  R. ,  &  Jackson,  P.  H.    Statistical  methods  for  educational 
and  psychological  research.    New  York:    McGraw-Hill,  1974. 

Novick,  M.  R. ,  &  Lewis,  C.    Prescribing  test  length  for  criterion- 
referenced  measurement.    In  C.  W.  Harris,  M.  C.  Alkin,  and  W.  J. 
Popham  (Eds.),  Problems  in  criterion-referenced  measurement. 
CSE  monograph  series  in  evaluation,  No.  3.    Los  Angeles:  Center 
for  the  Study  of  Evaluation,  University  of  California,  1974. 

Popham,  W.  J.    Normative  data  for  criterion-referenced  tests?    Phi  Delta 
Kappan.  1976,  58»  593-594. 

Popham,  W.  J.    Criterion-referenced  measurement.    Englewood  Cliffs,  N.J. 
Prentice-Hall,  1978. 

Spinetti,  J.  P.,  &  Hambleton,  R.  K.    A  computer  simulation  study  of 
tailored  testing  strategies  for  objective-based  instructional 
programs;    Educational  and  Psychological  Measurement,  1977,  37, 
139-158. 

Wood,  R.    Re$ponse-contingent  testing.    Review  of  Educational  Research, 
1973,  43,  529-544. 


550 


f 

i 


